How to store last n-days of historical data in a partitioned table

How to store last n-days of historical data in a partitioned table - second attempt

In the previous post, I've tried to reproduce the thinking process of how I redesigned the way historical data is stored in the relational database. The goal was to speed up daily maintenance tasks that age data. Dropping the oldest part of data would now take only milliseconds to execute. That was a huge success since with the previous solution the ageing task was running for minutes causing table-level locks and leading to drastic underperformance of the whole system.

Why that was not the best idea?

It's true that maintenance improvement was a success. It wasn't long before I realized introducing an additional column with rolling integer values by which the table had been partitioned maybe is a good idea to ease up moving the sliding window of data to another day but now writing ad-hoc queries in a way they would take advantage of partitioning is not straightforward hard. One would need to add an additional condition in each query on partition index column. Furthermore each time we would have to establish what's today's partition index value.

The tradeoff

If I were to do it again I would certainly not create an additional column with partitionIndex. I would rely directly on datetime2 column. The daily maintenance code would be way more complex but writing ad-hoc SQL queries would be as easy as possible.

Changes in daily maintenance stored procedure

Instead of simply truncating a single partition, every night we would have to:

Truncate the left-most partition
Split the right-most partition with new boundary (an additional day)
Merge the old partition with the new boundary

It definitely makes the maintenance code a more complexed - I've just shown an example but in order to really implement that solution we would have to automate the splitting/merge in a way the boundaries are dynamically calculated.

The recap

Let's summarize. We have a table called HistoricalData with among some others there is a referenceDate datetime2 column. We want to maintain 35 partitions with some full days of past data plus a working day.

The goal is to perform daily maintenance within milliseconds and speed up ad-hoc SQL queries.

Let's create an example table this time for the sake of this tutorial. Then let's create a partition scheme, partition function and then a clustered index based on that scheme.

After that let's create daily maintenance procedure that would be executed each night right after midnight.

Let's test the solution. First, let's create a helper procedure to query partition details.

When we execute the procedure we can see the partitions as expected.

It is the 5th of July at the time I write this sentence. A perfect time to execute our daily maintenance procedure. The effect is as expected, the oldest day has been truncated and the partition has been merged. A new partition has been created for future data.

That's pretty much it. There is one thing that I've never mentioned. Our partitions operate within a single filegroup which is not the most effective way of partitioning the data. I didn't care about that as my DB is a part of AWS RDS service where I'm not sure whether it is even possible to define multiple filegroups.

KnotGillCup

Search This Blog