Skip to main content

How (not) to store last n-days of historical data in a partitioned table

Photo by Jan Antonin Kolar on Unsplash
Some time ago I was asked to enhance the way historical data is kept and maintained in an MS SQL Server DB. I was told that it is required to keep 35 full days of data, every day after midnight the data from the oldest day should be removed. I was asked for help because the data ageing was taking too much time and was causing locks on the problematic table, which made the whole system to underperform drastically during that time.

At that time, the data was being stored in a single SQL table. The volume was around 160 million records. Data ageing was being done by a simple DELETE FROM statement with a condition on the indexed date column.

The solution seemed pretty obvious to me - I could partition the table by days and then, during the daily maintenance, simply truncate the partition containing the oldest data. Since it's a meta-data operation, it should take only milliseconds.

Because the whole operation had to be done without any downtime, I decided to back up the historical data, and create a brand new empty table with the same columns definitions. The Historical data could be retrieved on demand from the backup, I wasn't planning to restore it ever. Making an existing table partitioned would require to rebuild the indexes (at least the clustered one). I've checked on the test environment that it would take more than 2 hours to do that. With the Enterprise version of MS SQL Server, the ONLINE option is available when recreating indexes but that was no option here, as the DB was a Standard Edition.

I decided not to partition directly over a column with the reference date which was the datetime2 type. Instead, I've added an additional column of the tinyint type that would state for a partition index. I couldn't make it a computed column because you cannot build a partition function on such columns.

I've come up with a function that would map the current date to the partitionIndex column value. It is a pretty simple epoch-like function: take the number of days since 29/05/1982, modulo the number of partitions, plus two (one for the working day and one empty for the next day). Although that approach makes the daily maintenance really easy I realized later that it wasn't the best idea after all. I will explain why in the next article.

Next, I had to update a stored procedure through which all the inserts are made. I've added a call to f_getHistoricalDataPartitionIndex(referenceDate) as a value of  the additional column.

Now we can introduce a partition function and a partition scheme. For us, it was enough to partition the table over a single filegroup (PRIMARY). The DB is actually an AWS RDS service. I'm not sure if it's even possible to add additional files there. To check whether the definitions have been created you can use the following query

We're ready to make the HistoricalData table partitioned. I've done it by creating a clustered index based on the partition scheme. I've also created additional indexes that were existing in the original table. This time based on the partition scheme. I've also added the partitionIndex column to all of them.
You can verify whether everything has been created properly with the following query. Among other details, it also shows how many records there are in each partition. I've found it very useful also during the daily maintenance procedure tests.

Finally, we can take advantage of all the effort and implement the daily maintenance procedure that would simply truncate the partition that will be used the next day. The procedure, on our environment, is called by the DB job executed every day at 2:00 AM.

All the steps that were taken to partition the existing non-partitioned table:

  1. Backup - rename the existing table and use SELECT INTO query to create a twin empty table without any indexes
  2. Add additional column over which the table with be partitioned
  3. Create a function to retrieve the partition index value based on a given date
  4. Alter the procedure that inserts historical data
  5. Create a partition function and a partition scheme
  6. Create a clustering index over the partition scheme (and all the indexes you previously had)
  7. Alter/Create a procedure to execute daily maintenance - truncate the partition that will be used on the following day.

What to do when you want to keep the data and you can afford a downtime

  1. Add additional column over which the table with be partitioned
  2. Create a function to retrieve the partition index value based on a given date
  3. Alter the procedure that inserts historical data
  4. Create a partition function and a partition scheme
  5. Recreate the indexes with DROP_EXISTING option set to ON this time based on the partition scheme. You can also try the ONLINE option if you have an enterprise edition
  6. Alter/Create a procedure to execute daily maintenance - truncate the partition that will be used on the following day.

What to do when you don't want to have a clustering index

It is still possible to partition the table. Haven't tried it on any working system but if you have a unique column (or unique columns combination) you could ALTER the table adding the constraint and then DROP it while placing the table over the partition scheme.

What to do when you have an older version of MS SQL Server

Truncating chosen partitions is a feature available since SQL Server 2016. If you have an older version you could use the SWITCH PARTITION option of ALTER TABLE statement to achieve immediate data ageing. The idea is simple: Create two twin tables on the same partition scheme, with the same columns and indexes, in the same filegroup. Insert your data to one of them, keep the second empty. During the daily maintenance just switch corresponding partitions between tables and truncate the second table.

Comments

Popular posts from this blog

System.exit() not working?

Until recently, I was living in a world where calling a System.exit() method was not very elegant but 100% effective, bulletproof and ultimate way to shut down a multi-threaded Java application. Over the years, in many Java projects, I've seen a quite similar approach to shutting down the application. First, we try to shut down gracefully by finishing or interrupting all the threads, which is by the way not a trivial task in a large application, then, after a timeout is reached, we give up and call  System.exit()  method - we want to be sure that the JVM is stopped. The exact same approach was applied in the application I was recently refactoring. I've been doing a massive redesign, touched about three hundred classes, changed many of them drastically. A day or two after one of the huge merges a bug was discovered - shutting down the application takes ages. I've checked the logs, confronting it with the source code, the whole flow was pretty straightforward and I was

How to store last n-days of historical data in a partitioned table - second attempt

In the previous post , I've tried to reproduce the thinking process of how I redesigned the way historical data is stored in the relational database. The goal was to speed up daily maintenance tasks that age data. Dropping the oldest part of data would now take only milliseconds to execute. That was a huge success since with the previous solution the ageing task was running for minutes causing table-level locks and leading to drastic underperformance of the whole system. Why that was not the best idea? It's true that maintenance improvement was a success. It wasn't long before I realized introducing an additional column with rolling integer values by which the table had been partitioned maybe is a good idea to ease up moving the sliding window of data to another day but now writing ad-hoc queries in a way they would take advantage of partitioning is not straightforward hard. One would need to add an additional condition in each query on partition index column. Furtherm