Magic of Portfolio Charts6 min read
An equity market is a market in which shares of companies are issued and traded. It is basically a meeting point for buyers and sellers of stocks. When an investor buys stocks that ultimately get credited in his / her Demat account, it forms part of their holdings in his / her portfolio. An investor typically needs a system by which, over a period of time, his / her portfolio performance could be tracked and also to get some insights. To support this, we have developed the Portfolio Charts system that would give a graphical representation of an investor’s total purchase value and current market value.
In this blog post, we will give an overview of the Portfolio Charts system at Paytm Money and share some of the learnings we gained along the way.
First, let’s talk about how the holdings information gets generated based on the stocks bought by an investor. Let’s suppose below are the user’s transactions.
|03 Jan 2022||S1||Buy||10|
|03 Jan 2022||S2||Buy||5|
|04 Jan 2022||S1||Sell||5|
|05 Jan 2022||S3||Buy||10|
|05 Jan 2022||S2||Buy||5|
So at the end of 05 Jan 2022 and onwards, we will have four records as the buy transactions get accumulated, if not sold.
|Holding Date||Stock||Type||Quantity||Buy Date|
|03 Jan 2022||S1||Buy||10||03 Jan 2022|
|S2||Buy||5||03 Jan 2022|
|04 Jan 2022||S1||Buy||5||03 Jan 2022|
|S2||Buy||5||03 Jan 2022|
|05 Jan 2022||S1||Buy||5||03 Jan 2022|
|S2||Buy||5||03 Jan 2022|
|S2||Buy||5||05 Jan 2022|
|S3||Buy||10||05 Jan 2022|
The idea of the above table is to let you imagine how the records will grow over a period of time.
In Paytm Money, we have a third party legacy system which generates the users’ holdings based on transactions on a daily basis and uses an MS SQL Server at the persistence layer. Database tables which contain the holdings have more than 25 columns and billions of rows as holdings get accumulated. They store only the last N(15) days’ holdings because the latest date’s record is sufficient to know a user’s holdings and the last N days’ holdings are good enough to correct the system in case any inconsistencies are found. This would also reduce the amount of storage used. Apart from this, the third party system suffered from the following problems –
- Not robust enough
- Not scalable and
- Could not handle high traffic volumes.
Improving the legacy system was not an option due to multiple constraints. Owing to those we decided to build our own scalable system by pulling out all the holdings on a daily basis from the MS SQL Server.
We had four major challenges
- Processing billions of holding records in a reasonable time and build the charts’ data points.
- Choosing the destination persistence layer and minimizing the storage utilization.
- Serving the data with low latency.
- Reprocessing of a chart’s data points when incorrect holdings get pushed into the source database.
Optimizing DB record process
Let’s talk about the data source which is MS SQL Server in our case. Fetching the millions of records was really a big challenge and we did multiple iterations to find out the sweet spot and that is, to fetch the rows in batches through a range query of sequential primary keys. We have followed the below steps to achieve this:
- Get the max and min primary key id.
- Hit the query with the primary key range condition in where clause in batches.
Why does this logic work?
SQL Server uses B-tree data structure to store clustered primary keys which makes the retrieval faster. It has three levels.
Root level: The root level is the starting point of the query
Intermediate level: This level provides a connection between root and leaf levels.
Leaf Level: This level is the lowest level of the clustered index and all records are stored at this level
For example, when we want to query suppose 130 <= id <400, Query will begin its traversal from root level to intermediate level then reaches the leaf level. As our ids are in sequence there is no need of going back to the root and starting the search again. It will first read the data from the current leaf and go to the next leaf and so on until it meets the condition.
Choosing the persistence layer
Our second task was to choose the right destination storage entity which can scale well, provide quick searching capabilities and be able to store huge volumes of data quickly and in near real-time. We explored multiple options like MongoDB, Elasticsearch, Cassandra etc. and found that Elasticsearch suits our requirements very well. It’s a distributed NoSQL database system and analytics engine built on Apache Lucene. We’ll talk a bit more about why we chose Elasticsearch.
- It provides extensive REST APIs for storing and searching the data
- It is a Document-based NoSQL database and the documents have a key and value pair, so we can keep all the holdings of users in one document and can search on some of the keys.
- It’s horizontally scalable, meaning data can be distributed amongst multiple nodes.
- It also has the concept of sharding and routing. Sharding is basically a mechanism to divide the data volume into smaller chunks and each shard itself is a lucene search engine. Shards will be distributed among the nodes uniformly. Routing is a way to tell Elasticsearch where to keep / search for the data.
The next task was to optimize the Elasticsearch storage. We did the following to achieve an optimized storage:
- Stored all nested details in the form of an array rather than storing them in a list of objects.
- Reviewed whether the variables fit into smaller data types.
The next and most important challenge was the API performance optimization. We did the following to achieve it.
- Indexed only those fields which are needed to perform the search operation. This helped to reduce the memory pressure on Elasticsearch
- As there is no date data type available and the date value typically gets converted into epoch long value before getting stored in Elasticsearch, we decided to use epoch long values for dates rather than using formatted dates. This saved a lot of operations while performing search operations.
- Excluded unnecessary fields from the search response.
Sometimes, due to wrong trade data or incorrect corporate action submission, inconsistent holdings get pushed to the third party legacy system and the same gets reflected in the Elasticsearch index as well. These inconsistencies typically get resolved on the third party system with some manual efforts. Now the challenge is to make the Elasticsearch index consistent.
One easy but inefficient solution is to reprocess those records again when required. But we didn’t choose this approach since it requires extra manual effort on both sides . So, we built a system that pulls ALL last N days’ holdings on a daily basis which are available in the legacy system. That way, we do not have to worry about the corrections happening on the legacy system side.
As some will read this blog to learn something new, our idea is to showcase how one can put the right technology in the right place to improve the scalability of a system that integrates with an existing third party legacy system that can’t be improvised.
Blog contribution: Raj Kumar Saroj