In the previous post, I've tried to reproduce the thinking process of how I redesigned the way historical data is stored in the relational database. The goal was to speed up daily maintenance tasks that age data. Dropping the oldest part of data would now take only milliseconds to execute. That was a huge success since with the previous solution the ageing task was running for minutes causing table-level locks and leading to drastic underperformance of the whole system.
After that let's create daily maintenance procedure that would be executed each night right after midnight.
Let's test the solution. First, let's create a helper procedure to query partition details.
It is the 5th of July at the time I write this sentence. A perfect time to execute our daily maintenance procedure. The effect is as expected, the oldest day has been truncated and the partition has been merged. A new partition has been created for future data.
That's pretty much it. There is one thing that I've never mentioned. Our partitions operate within a single filegroup which is not the most effective way of partitioning the data. I didn't care about that as my DB is a part of AWS RDS service where I'm not sure whether it is even possible to define multiple filegroups.
Why that was not the best idea?
It's true that maintenance improvement was a success. It wasn't long before I realized introducing an additional column with rolling integer values by which the table had been partitioned maybe is a good idea to ease up moving the sliding window of data to another day but now writing ad-hoc queries in a way they would take advantage of partitioning is not straightforward hard. One would need to add an additional condition in each query on partition index column. Furthermore each time we would have to establish what's today's partition index value.
The tradeoff
If I were to do it again I would certainly not create an additional column with partitionIndex. I would rely directly on datetime2 column. The daily maintenance code would be way more complex but writing ad-hoc SQL queries would be as easy as possible.
Changes in daily maintenance stored procedure
Instead of simply truncating a single partition, every night we would have to:- Truncate the left-most partition
- Split the right-most partition with new boundary (an additional day)
- Merge the old partition with the new boundary
It definitely makes the maintenance code a more complexed - I've just shown an example but in order to really implement that solution we would have to automate the splitting/merge in a way the boundaries are dynamically calculated.
The recap
Let's summarize. We have a table called HistoricalData with among some others there is a referenceDate datetime2 column. We want to maintain 35 partitions with some full days of past data plus a working day.
The goal is to perform daily maintenance within milliseconds and speed up ad-hoc SQL queries.
Let's create an example table this time for the sake of this tutorial. Then let's create a partition scheme, partition function and then a clustered index based on that scheme.
After that let's create daily maintenance procedure that would be executed each night right after midnight.
Let's test the solution. First, let's create a helper procedure to query partition details.
It is the 5th of July at the time I write this sentence. A perfect time to execute our daily maintenance procedure. The effect is as expected, the oldest day has been truncated and the partition has been merged. A new partition has been created for future data.
That's pretty much it. There is one thing that I've never mentioned. Our partitions operate within a single filegroup which is not the most effective way of partitioning the data. I didn't care about that as my DB is a part of AWS RDS service where I'm not sure whether it is even possible to define multiple filegroups.
Comments
Post a Comment