Until recently, I was living in a world where calling a System.exit() method was not very elegant but 100% effective, bulletproof and ultimate way to shut down a multi-threaded Java application. Over the years, in many Java projects, I've seen a quite similar approach to shutting down the application. First, we try to shut down gracefully by finishing or interrupting all the threads, which is by the way not a trivial task in a large application, then, after a timeout is reached, we give up and call System.exit() method - we want to be sure that the JVM is stopped.
The exact same approach was applied in the application I was recently refactoring. I've been doing a massive redesign, touched about three hundred classes, changed many of them drastically. A day or two after one of the huge merges a bug was discovered - shutting down the application takes ages.
I've checked the logs, confronting it with the source code, the whole flow was pretty straightforward and I was sure System.exit() had been called.
It didn't cause the JVM stoppage. I could see that after the call there are still some lines being logged from one of the periodic workers. The app had been running for 10 more minutes not doing much until it was finally killed by an outside wrapper process (thank god).
I started to google for the possible reasons System.exit() call may not be doing its job. Haven't found the answer, which later turned out to be the reason I decided to start a blog and write an article on that.
I had to take a closer look at the process of shutting down the application. The trouble is, the application is sometimes deployed as a Windows service and sometimes as a Linux systemd service. There are a few paths to shut down the application used in production.
I've seen all three lines in the log file. I couldn't understand why after a System.exit() call the JVM was still running.
I came up with an answer when I tried to reproduce the issue. I've realized that in this case, the person who had found the bug was restarting the windows service, which caused a shutdown hook to be called, which then called the terminating service. It turns out that System.exit() actually delegates the call to Runtime.exit() which first calls all the shutdown hooks and the finalizers and then terminates the JVM, unlike its sibling- the Runtime.halt() method, which just forcibly stops the JVM.
Although I knew that was it I still didn't understand why instead of a recurrence shutdown hook calls I could only see a single System.exit() call in logs. It all made clear when I checked the code of java.lang.ApplicationShutdownHooks#runHooks. Adding a shutdown hook is really adding a java.lang.Thread instance which is started when hooks are being run. The start method in the Thread class is marked as synchronized. That's why there was only a single entry in the log file. Now everything was clear. The lesson learned here is to never call System.exit() method from a shutdown hook.
The exact same approach was applied in the application I was recently refactoring. I've been doing a massive redesign, touched about three hundred classes, changed many of them drastically. A day or two after one of the huge merges a bug was discovered - shutting down the application takes ages.
I've checked the logs, confronting it with the source code, the whole flow was pretty straightforward and I was sure System.exit() had been called.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2019-07-25T08:09:56,840+0000 [Thread-17] INFO TerminatingServiceImpl: Shutting down the engine.. | |
2019-07-25T08:09:56,840+0000 [Thread-17] INFO HealthCheckServiceImpl: Shutting down health check service. | |
2019-07-25T08:09:56,840+0000 [Thread-17] INFO WatchdogPingServiceImpl: Shutting down watchdog ping service. | |
2019-07-25T08:09:56,856+0000 [pool-17-thread-1] INFO BrowserMonitoringModule: Shutting down BrowserMonitoringModule issued. | |
2019-07-25T08:09:56,856+0000 [pool-17-thread-1] INFO TestFetcher: FetchExecutor was interrupted. | |
2019-07-25T08:09:56,856+0000 [Thread-3] INFO PoolManager: Queue.take Interrupted. | |
2019-07-25T08:09:56,856+0000 [pool-17-thread-1] INFO AwsS3CloudFileManager: shutting down AWS S3 uploader | |
2019-07-25T08:09:56,856+0000 [Thread-17] INFO TerminatingServiceImpl: Shutting down Jetty.. | |
2019-07-25T08:09:56,856+0000 [Thread-17] INFO TerminatingServiceImpl: Calling system exiting service.. | |
2019-07-25T08:09:56,856+0000 [Thread-17] INFO SystemExitingServiceImpl: Shutting down common services.. | |
2019-07-25T08:09:56,856+0000 [Thread-17] INFO SystemExitingServiceImpl: Shutting down beacon availability service.. | |
2019-07-25T08:09:56,856+0000 [Thread-17] INFO SystemExitingServiceImpl: Shutting down http client pooling factory.. | |
2019-07-25T08:09:56,856+0000 [Thread-17] INFO SystemExitingServiceImpl: Shutting down chrome dump dir watcher.. | |
2019-07-25T08:09:56,856+0000 [Thread-17] INFO SystemExitingServiceImpl: Calling system exit with code 23.. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2019-07-25T08:09:56,871+0000 [StatusService-Thread-1] WARN ApacheHttpRequestRetryHandler: Retrying request after exception Socket operation on nonsocket: configureBlocking, retry count: 1 | |
2019-07-25T08:10:06,590+0000 [pool-3-thread-1] INFO PropertiesRestLoader: Loading properties from REST source. | |
2019-07-25T08:10:06,590+0000 [pool-3-thread-1] WARN ApacheHttpRequestRetryHandler: Retrying request after exception Software caused connection abort: recv failed, retry count: 1 | |
2019-07-25T08:10:10,684+0000 [pool-3-thread-1] ERROR PropertiesRestLoader: Failed to get configuration from cluster due to error: Connect to localhost:9999 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused: connect | |
(...) | |
2019-07-25T08:10:14,779+0000 [pool-3-thread-1] ERROR PropertiesRestLoader: Failed to get configuration from cluster due to error: Connect to localhost:9999 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused: connect | |
(...) | |
2019-07-25T08:16:10,185+0000 [pool-3-thread-1] ERROR PropertiesRestLoader: Failed to get configuration from cluster due to error: Connect to localhost:9999 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused: connect |
I had to take a closer look at the process of shutting down the application. The trouble is, the application is sometimes deployed as a Windows service and sometimes as a Linux systemd service. There are a few paths to shut down the application used in production.
- Shutting down the windows service
- Killing the Linux process
- Using a REST endpoint
- The application can shutdown itself when self-diagnostics fails
Because of the different paths and the fact that before the shutdown some very important cleanup needs to be done, we've introduced a terminating service that could be used by the variety of components. During the recent refactoring, I spread the usage of the service all over the code. I've removed all the ad-hoc similar code and replaced it with a call to the terminating service.
The algorithm there was pretty simple:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public void terminate() { | |
log.info("Shutting down the engine.."); | |
doSomethingReallyImportant(); | |
log.info("Shutting down common services.."); | |
shutdownCommonServices(); | |
log.info("Calling system exit with code {}..", WrapperCodes.RESTART_CODE.getExitCode()); | |
System.exit(WrapperCodes.RESTART_CODE.getExitCode()); | |
} |
I've seen all three lines in the log file. I couldn't understand why after a System.exit() call the JVM was still running.
I came up with an answer when I tried to reproduce the issue. I've realized that in this case, the person who had found the bug was restarting the windows service, which caused a shutdown hook to be called, which then called the terminating service. It turns out that System.exit() actually delegates the call to Runtime.exit() which first calls all the shutdown hooks and the finalizers and then terminates the JVM, unlike its sibling- the Runtime.halt() method, which just forcibly stops the JVM.
Although I knew that was it I still didn't understand why instead of a recurrence shutdown hook calls I could only see a single System.exit() call in logs. It all made clear when I checked the code of java.lang.ApplicationShutdownHooks#runHooks. Adding a shutdown hook is really adding a java.lang.Thread instance which is started when hooks are being run. The start method in the Thread class is marked as synchronized. That's why there was only a single entry in the log file. Now everything was clear. The lesson learned here is to never call System.exit() method from a shutdown hook.
Comments
Post a Comment