Skip to main content

How to store last n-days of historical data in a partitioned table - second attempt

In the previous post, I've tried to reproduce the thinking process of how I redesigned the way historical data is stored in the relational database. The goal was to speed up daily maintenance tasks that age data. Dropping the oldest part of data would now take only milliseconds to execute. That was a huge success since with the previous solution the ageing task was running for minutes causing table-level locks and leading to drastic underperformance of the whole system.

Why that was not the best idea?

It's true that maintenance improvement was a success. It wasn't long before I realized introducing an additional column with rolling integer values by which the table had been partitioned maybe is a good idea to ease up moving the sliding window of data to another day but now writing ad-hoc queries in a way they would take advantage of partitioning is not straightforward hard. One would need to add an additional condition in each query on partition index column. Furthermore each time we would have to establish what's today's partition index value.

The tradeoff

If I were to do it again I would certainly not create an additional column with partitionIndex. I would rely directly on datetime2 column. The daily maintenance code would be way more complex but writing ad-hoc SQL queries would be as easy as possible.

Changes in daily maintenance stored procedure

Instead of simply truncating a single partition, every night we would have to:
  • Truncate the left-most partition
    TRUNCATE TABLE [HistoricalData] with (partitions (1));
  • Split the right-most partition with new boundary (an additional day)
    ALTER Partition Scheme HistoricalDataScheme NEXT USED PRIMARY;
    ALTER Partition Function HistoricalDataPF () SPLIT RANGE ('2020-07-03');
  • Merge the old partition with the new boundary
    ALTER Partition Function HistoricalDataPF() MERGE RANGE ('2020-05-30');

It definitely makes the maintenance code a more complexed - I've just shown an example but in order to really implement that solution we would have to automate the splitting/merge in a way the boundaries are dynamically calculated.

The recap

Let's summarize. We have a table called HistoricalData with among some others there is a referenceDate datetime2 column. We want to maintain 35 partitions with some full days of past data plus a working day.

The goal is to perform daily maintenance within milliseconds and speed up ad-hoc SQL queries.

Let's create an example table this time for the sake of this tutorial. Then let's create a partition scheme, partition function and then a clustered index based on that scheme.

CREATE TABLE [dbo].[HistoricalData](
[id] [int] NOT NULL,
[name] [nvarchar](50) NOT NULL,
[value] [int] NOT NULL,
[referenceDate] [datetime2](7) NOT NULL
) ON [PRIMARY]
GO
CREATE PARTITION FUNCTION HistoricalDataPF (datetime2)
AS RANGE LEFT FOR VALUES (
'2020-06-01', '2020-06-02', '2020-06-03', '2020-06-04', '2020-06-05', '2020-06-06'
,'2020-06-07', '2020-06-08', '2020-06-09', '2020-06-10', '2020-06-11', '2020-06-12', '2020-06-13'
,'2020-06-14', '2020-06-15', '2020-06-16', '2020-06-17', '2020-06-18', '2020-06-19', '2020-06-20'
,'2020-06-21', '2020-06-22', '2020-06-23', '2020-06-24', '2020-06-25', '2020-06-26', '2020-06-27'
,'2020-06-28', '2020-06-29', '2020-06-30', '2020-07-01', '2020-07-02', '2020-07-03', '2020-07-04'
)
GO
CREATE PARTITION SCHEME HistoricalDataScheme
AS PARTITION HistoricalDataPF ALL TO ([PRIMARY])
GO
CREATE CLUSTERED INDEX [cidx] ON [dbo].[HistoricalData]
(
[id] ASC,
[referenceDate] ASC
)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON HistoricalDataScheme(referenceDate)
GO

After that let's create daily maintenance procedure that would be executed each night right after midnight.
create procedure DailyMaintenance
as
begin
declare @dayToMerge as datetime2
declare @nextDay as datetime2
select @nextDay = cast(cast(sysutcdatetime() as date) as datetime2);
select @dayToMerge = dateadd(day, -34, @nextDay);
truncate table [HistoricalData] with (partitions (1));
alter partition scheme HistoricalDataScheme next used [PRIMARY];
alter partition function HistoricalDataPF () split range (@nextDay);
alter partition function HistoricalDataPF() merge range (@dayToMerge);
end


Let's test the solution. First, let's create a helper procedure to query partition details.
CREATE PROCEDURE PartitionDetails
AS
SELECT OBJECT_NAME(pstats.object_id) AS TableName
, ps.name AS PartitionSchemeName
, ds.name AS PartitionFilegroupName
, pf.name AS PartitionFunctionName
, prv.value AS PartitionBoundaryValue
, c.name AS PartitionKey
, CASE
WHEN pf.boundary_value_on_right = 0
THEN c.name + ' > ' + CAST(ISNULL(LAG(prv.value)
OVER (PARTITION BY pstats.object_id ORDER BY pstats.object_id, pstats.partition_number),
'Infinity') AS VARCHAR(100)) + ' and ' + c.name + ' <= ' +
CAST(ISNULL(prv.value, 'Infinity') AS VARCHAR(100))
ELSE c.name + ' >= ' + CAST(ISNULL(prv.value, 'Infinity') AS VARCHAR(100)) + ' and ' + c.name + ' < ' + CAST(
ISNULL(LEAD(prv.value)
OVER (PARTITION BY pstats.object_id ORDER BY pstats.object_id, pstats.partition_number),
'Infinity') AS VARCHAR(100))
END AS PartitionRange
, pstats.partition_number AS PartitionNumber
, pstats.row_count AS PartitionRowCount
FROM sys.dm_db_partition_stats AS pstats
INNER JOIN sys.partitions AS p ON pstats.partition_id = p.partition_id
INNER JOIN sys.destination_data_spaces AS dds ON pstats.partition_number = dds.destination_id
INNER JOIN sys.data_spaces AS ds ON dds.data_space_id = ds.data_space_id
INNER JOIN sys.partition_schemes AS ps ON dds.partition_scheme_id = ps.data_space_id
INNER JOIN sys.partition_functions AS pf ON ps.function_id = pf.function_id
INNER JOIN sys.indexes AS i ON pstats.object_id = i.object_id AND pstats.index_id = i.index_id AND
dds.partition_scheme_id = i.data_space_id AND
i.type <= 1
INNER JOIN sys.index_columns AS ic
ON i.index_id = ic.index_id AND i.object_id = ic.object_id AND ic.partition_ordinal > 0
INNER JOIN sys.columns AS c ON pstats.object_id = c.object_id AND ic.column_id = c.column_id
LEFT JOIN sys.partition_range_values AS prv ON pf.function_id = prv.function_id AND pstats.partition_number =
(CASE pf.boundary_value_on_right
WHEN 0
THEN prv.boundary_id
ELSE (prv.boundary_id + 1) END)
WHERE ps.name = 'HistoricalDataScheme'
ORDER BY TableName, PartitionNumber;
When we execute the procedure we can see the partitions as expected.

It is the 5th of July at the time I write this sentence. A perfect time to execute our daily maintenance procedure. The effect is as expected, the oldest day has been truncated and the partition has been merged. A new partition has been created for future data.


That's pretty much it. There is one thing that I've never mentioned. Our partitions operate within a single filegroup which is not the most effective way of partitioning the data. I didn't care about that as my DB is a part of AWS RDS service where I'm not sure whether it is even possible to define multiple filegroups.

Comments

Popular posts from this blog

System.exit() not working?

Until recently, I was living in a world where calling a System.exit() method was not very elegant but 100% effective, bulletproof and ultimate way to shut down a multi-threaded Java application. Over the years, in many Java projects, I've seen a quite similar approach to shutting down the application. First, we try to shut down gracefully by finishing or interrupting all the threads, which is by the way not a trivial task in a large application, then, after a timeout is reached, we give up and call  System.exit()  method - we want to be sure that the JVM is stopped. The exact same approach was applied in the application I was recently refactoring. I've been doing a massive redesign, touched about three hundred classes, changed many of them drastically. A day or two after one of the huge merges a bug was discovered - shutting down the application takes ages. I've checked the logs, confronting it with the source code, the whole flow was pretty straightforward and I was...

How (not) to store last n-days of historical data in a partitioned table

Photo by  Jan Antonin Kolar  on  Unsplash Some time ago I was asked to enhance the way historical data is kept and maintained in an MS SQL Server DB. I was told that it is required to keep 35 full days of data, every day after midnight the data from the oldest day should be removed. I was asked for help because the data ageing was taking too much time and was causing locks on the problematic table, which made the whole system to underperform drastically during that time. At that time, the data was being stored in a single SQL table. The volume was around 160 million records. Data ageing was being done by a simple DELETE FROM statement with a condition on the indexed date column. The solution seemed pretty obvious to me - I could partition the table by days and then,  during the daily maintenance,  simply truncate the partition containing the oldest data . Since it's a meta-data operation, it should take only milliseconds. Because the whole operation ...