Technology

Magic of Portfolio Charts6 min read

April 27, 2022
Magic of Portfolio Charts

author:

Magic of Portfolio Charts6 min read

An equity market is a market in which shares of companies are issued and traded. It is basically a meeting point for buyers and sellers of stocks. When an investor buys stocks that ultimately get credited in his / her Demat account, it forms part of their holdings in  his / her portfolio. An investor typically needs  a system by which, over a period of time, his / her portfolio performance could be tracked and also to get some insights. To support this, we have developed the Portfolio Charts system that would give a graphical representation of an investor’s total purchase value and current market value.

In this blog post, we will give an overview of the Portfolio Charts system at Paytm Money and share some of the learnings we gained along the way.

Magic of Portfolio Charts

Background Knowledge

First, let’s talk about how the holdings information gets generated based on the stocks bought by an investor. Let’s suppose below are the user’s transactions.

Date Stock Type Quantity
03 Jan 2022 S1 Buy 10
03 Jan 2022 S2 Buy 5
04 Jan 2022 S1 Sell 5
05 Jan 2022 S3 Buy 10
05 Jan 2022 S2 Buy 5

So at the end of 05 Jan 2022 and onwards, we will have four records as the buy transactions get accumulated, if not sold.

Holding Date Stock Type Quantity Buy Date
03 Jan 2022 S1 Buy 10 03 Jan 2022
S2 Buy  5 03 Jan 2022
04 Jan 2022 S1 Buy 5 03 Jan 2022
S2 Buy 5 03 Jan 2022
05 Jan 2022 S1 Buy 5 03 Jan 2022
S2 Buy 5 03 Jan 2022
S2 Buy 5 05 Jan 2022
S3 Buy 10 05 Jan 2022

The idea of the above table is to let you imagine how the records will grow over a period of time.

Problem Statement

In Paytm Money, we have a third party legacy system which generates the users’ holdings based on transactions on a daily basis and uses an MS SQL Server at the persistence layer. Database tables which contain the holdings have more than 25 columns and billions of rows as holdings get accumulated. They store only the last N(15) days’ holdings because the latest date’s record is sufficient to know a user’s holdings and the last N days’ holdings are good enough to correct the system in case any inconsistencies are found.  This would also reduce the amount of storage used. Apart from this, the third party system suffered from the following problems – 

  • Not robust enough 
  • Not scalable and 
  • Could not handle high traffic volumes. 

Improving the legacy system was not an option due to multiple constraints. Owing to those we decided to build our own scalable system by pulling out all the holdings on a daily basis from the MS SQL Server.

Challenges

 We had four major challenges

  • Processing billions of holding records in a reasonable time and build the charts’ data points.
  • Choosing the destination persistence layer and minimizing the storage utilization.
  • Serving the data with low latency.
  • Reprocessing of a chart’s data points when incorrect holdings get pushed into the source database.

Optimizing DB record process

Let’s talk about the data source which is MS SQL Server in our case. Fetching the millions of records was really a big challenge and we did multiple iterations to find out the sweet spot and that is, to fetch the rows in batches through a range query of sequential primary keys. We have followed the below steps to achieve this:

  • Get the max and min primary key id.
  • Hit the query with the primary key range condition in where clause in batches.

Why does this logic work?

SQL Server uses B-tree data structure to store clustered primary keys which makes the retrieval faster. It has three levels.  

    Root level: The root level is the starting point of the query

    Intermediate level: This level provides a connection between root and leaf levels.

    Leaf Level: This level is the lowest level of the clustered index and all records are stored at   this level

For example, when we want to query suppose 130 <= id <400, Query will begin its traversal from root level to intermediate level then reaches the leaf level. As our ids are in sequence there is no need of going back to the root and starting the search again. It will first read the data from the current leaf and go to the next leaf and so on until it meets the condition.

Optimizing DB record process

Choosing the persistence layer

Our second task was to choose the right destination storage entity which can scale well, provide quick searching capabilities and be able to store huge volumes of data quickly and in near real-time. We explored multiple options like MongoDB, Elasticsearch, Cassandra etc. and found that Elasticsearch suits our requirements very well. It’s a distributed NoSQL database system and analytics engine built on Apache Lucene. We’ll talk a bit more about why we chose Elasticsearch.

  • It provides extensive REST APIs for storing and searching the data
  • It is a Document-based NoSQL database and the documents have a key and value pair, so we can keep all the holdings of users in one document and can search on some of the keys.
  • It’s horizontally scalable, meaning data can be distributed amongst multiple nodes.
  • It also has the concept of sharding and routing. Sharding is basically a mechanism to divide the data volume into smaller chunks and each shard itself is a lucene search engine. Shards will be distributed among the nodes uniformly.  Routing is a way to tell Elasticsearch where to keep / search for the data. 

Storage Optimization

The next task was to optimize the  Elasticsearch storage. We did the following to achieve an optimized storage:

  • Stored all nested details in the form of an array rather than storing them in a list of objects.
  • Reviewed whether the variables fit into smaller data types.

API Optimization

The  next and most important challenge was the API  performance optimization. We did the following to achieve it.

  • Indexed only those fields which are needed to perform the search operation. This helped to  reduce the memory pressure on Elasticsearch
  • As there is no date data type available and the date value typically gets converted into epoch long value before getting stored in  Elasticsearch, we decided to use epoch long values for dates rather than using formatted dates. This saved a lot of operations while performing search operations. 
  • Excluded unnecessary fields from the search response.

Reprocessing 

Sometimes, due to wrong trade data or incorrect corporate action submission, inconsistent holdings get pushed to the third party legacy system and the same gets reflected in the Elasticsearch index as well. These inconsistencies typically get resolved on  the third party system with some manual efforts. Now the challenge is to  make the Elasticsearch index consistent.

One easy but inefficient solution is to reprocess those records again when required. But we didn’t choose this approach since it requires extra manual effort on  both sides . So, we built  a system that  pulls ALL last N days’ holdings on a daily basis which are available in the legacy system. That way, we do not have to worry about the corrections happening on the legacy system side.

Conclusion 

As some will read this blog to learn something new, our idea is to showcase how one can put the right technology in the right place to improve the scalability of a system that integrates with an existing third party legacy system that can’t be improvised.

Team Members: Ganesh Kumar, Raj Kumar Saroj, Koushik Savudam

Blog contribution: Raj Kumar Saroj