Skip to main content

How to store last n-days of historical data in a partitioned table - second attempt

In the previous post, I've tried to reproduce the thinking process of how I redesigned the way historical data is stored in the relational database. The goal was to speed up daily maintenance tasks that age data. Dropping the oldest part of data would now take only milliseconds to execute. That was a huge success since with the previous solution the ageing task was running for minutes causing table-level locks and leading to drastic underperformance of the whole system.

Why that was not the best idea?

It's true that maintenance improvement was a success. It wasn't long before I realized introducing an additional column with rolling integer values by which the table had been partitioned maybe is a good idea to ease up moving the sliding window of data to another day but now writing ad-hoc queries in a way they would take advantage of partitioning is not straightforward hard. One would need to add an additional condition in each query on partition index column. Furthermore each time we would have to establish what's today's partition index value.

The tradeoff

If I were to do it again I would certainly not create an additional column with partitionIndex. I would rely directly on datetime2 column. The daily maintenance code would be way more complex but writing ad-hoc SQL queries would be as easy as possible.

Changes in daily maintenance stored procedure

Instead of simply truncating a single partition, every night we would have to:
  • Truncate the left-most partition
  • Split the right-most partition with new boundary (an additional day)
  • Merge the old partition with the new boundary

It definitely makes the maintenance code a more complexed - I've just shown an example but in order to really implement that solution we would have to automate the splitting/merge in a way the boundaries are dynamically calculated.

The recap

Let's summarize. We have a table called HistoricalData with among some others there is a referenceDate datetime2 column. We want to maintain 35 partitions with some full days of past data plus a working day.

The goal is to perform daily maintenance within milliseconds and speed up ad-hoc SQL queries.

Let's create an example table this time for the sake of this tutorial. Then let's create a partition scheme, partition function and then a clustered index based on that scheme.


After that let's create daily maintenance procedure that would be executed each night right after midnight.


Let's test the solution. First, let's create a helper procedure to query partition details.
When we execute the procedure we can see the partitions as expected.

It is the 5th of July at the time I write this sentence. A perfect time to execute our daily maintenance procedure. The effect is as expected, the oldest day has been truncated and the partition has been merged. A new partition has been created for future data.


That's pretty much it. There is one thing that I've never mentioned. Our partitions operate within a single filegroup which is not the most effective way of partitioning the data. I didn't care about that as my DB is a part of AWS RDS service where I'm not sure whether it is even possible to define multiple filegroups.

Comments

Popular posts from this blog

System.exit() not working?

Until recently, I was living in a world where calling a System.exit() method was not very elegant but 100% effective, bulletproof and ultimate way to shut down a multi-threaded Java application. Over the years, in many Java projects, I've seen a quite similar approach to shutting down the application. First, we try to shut down gracefully by finishing or interrupting all the threads, which is by the way not a trivial task in a large application, then, after a timeout is reached, we give up and call  System.exit()  method - we want to be sure that the JVM is stopped. The exact same approach was applied in the application I was recently refactoring. I've been doing a massive redesign, touched about three hundred classes, changed many of them drastically. A day or two after one of the huge merges a bug was discovered - shutting down the application takes ages. I've checked the logs, confronting it with the source code, the whole flow was pretty straightforward and I was

How (not) to store last n-days of historical data in a partitioned table

Photo by  Jan Antonin Kolar  on  Unsplash Some time ago I was asked to enhance the way historical data is kept and maintained in an MS SQL Server DB. I was told that it is required to keep 35 full days of data, every day after midnight the data from the oldest day should be removed. I was asked for help because the data ageing was taking too much time and was causing locks on the problematic table, which made the whole system to underperform drastically during that time. At that time, the data was being stored in a single SQL table. The volume was around 160 million records. Data ageing was being done by a simple DELETE FROM statement with a condition on the indexed date column. The solution seemed pretty obvious to me - I could partition the table by days and then,  during the daily maintenance,  simply truncate the partition containing the oldest data . Since it's a meta-data operation, it should take only milliseconds. Because the whole operation had to be done witho