Skip to main content

System.exit() not working?

Until recently, I was living in a world where calling a System.exit() method was not very elegant but 100% effective, bulletproof and ultimate way to shut down a multi-threaded Java application. Over the years, in many Java projects, I've seen a quite similar approach to shutting down the application. First, we try to shut down gracefully by finishing or interrupting all the threads, which is by the way not a trivial task in a large application, then, after a timeout is reached, we give up and call System.exit() method - we want to be sure that the JVM is stopped.

The exact same approach was applied in the application I was recently refactoring. I've been doing a massive redesign, touched about three hundred classes, changed many of them drastically. A day or two after one of the huge merges a bug was discovered - shutting down the application takes ages.

I've checked the logs, confronting it with the source code, the whole flow was pretty straightforward and I was sure System.exit() had been called. It didn't cause the JVM stoppage. I could see that after the call there are still some lines being logged from one of the periodic workers. The app had been running for 10 more minutes not doing much until it was finally killed by an outside wrapper process (thank god).
I started to google for the possible reasons System.exit() call may not be doing its job. Haven't found the answer, which later turned out to be the reason I decided to start a blog and write an article on that.

I had to take a closer look at the process of shutting down the application. The trouble is, the application is sometimes deployed as a Windows service and sometimes as a Linux systemd service. There are a few paths to shut down the application used in production.
  • Shutting down the windows service
  • Killing the Linux process
  • Using a REST endpoint
  • The application can shutdown itself when self-diagnostics fails
Because of the different paths and the fact that before the shutdown some very important cleanup needs to be done, we've introduced a terminating service that could be used by the variety of components. During the recent refactoring, I spread the usage of the service all over the code. I've removed all the ad-hoc similar code and replaced it with a call to the terminating service.

The algorithm there was pretty simple:

I've seen all three lines in the log file. I couldn't understand why after a System.exit() call the JVM was still running.

I came up with an answer when I tried to reproduce the issue. I've realized that in this case, the person who had found the bug was restarting the windows service, which caused a shutdown hook to be called, which then called the terminating service. It turns out that System.exit() actually delegates the call to Runtime.exit() which first calls all the shutdown hooks and the finalizers and then terminates the JVM, unlike its sibling- the Runtime.halt() method, which just forcibly stops the JVM.

Although I knew that was it I still didn't understand why instead of a recurrence shutdown hook calls I could only see a single System.exit() call in logs. It all made clear when I checked the code of java.lang.ApplicationShutdownHooks#runHooks. Adding a shutdown hook is really adding a java.lang.Thread instance which is started when hooks are being run. The start method in the Thread class is marked as synchronized. That's why there was only a single entry in the log file. Now everything was clear. The lesson learned here is to never call System.exit() method from a shutdown hook.

TL;DR;

If System.exit() call does not seem to work for you, make sure you not calling it from a shutdown hook thread.

Comments

Popular posts from this blog

How to store last n-days of historical data in a partitioned table - second attempt

In the previous post , I've tried to reproduce the thinking process of how I redesigned the way historical data is stored in the relational database. The goal was to speed up daily maintenance tasks that age data. Dropping the oldest part of data would now take only milliseconds to execute. That was a huge success since with the previous solution the ageing task was running for minutes causing table-level locks and leading to drastic underperformance of the whole system. Why that was not the best idea? It's true that maintenance improvement was a success. It wasn't long before I realized introducing an additional column with rolling integer values by which the table had been partitioned maybe is a good idea to ease up moving the sliding window of data to another day but now writing ad-hoc queries in a way they would take advantage of partitioning is not straightforward hard. One would need to add an additional condition in each query on partition index column. Furtherm

How (not) to store last n-days of historical data in a partitioned table

Photo by  Jan Antonin Kolar  on  Unsplash Some time ago I was asked to enhance the way historical data is kept and maintained in an MS SQL Server DB. I was told that it is required to keep 35 full days of data, every day after midnight the data from the oldest day should be removed. I was asked for help because the data ageing was taking too much time and was causing locks on the problematic table, which made the whole system to underperform drastically during that time. At that time, the data was being stored in a single SQL table. The volume was around 160 million records. Data ageing was being done by a simple DELETE FROM statement with a condition on the indexed date column. The solution seemed pretty obvious to me - I could partition the table by days and then,  during the daily maintenance,  simply truncate the partition containing the oldest data . Since it's a meta-data operation, it should take only milliseconds. Because the whole operation had to be done witho