Under certain circumstances, services within your application can go completely offline. This can be caused by errors within code or a service’s resources being overwhelmed. This doc outlines how to diagnose what took a service offline and what can be done to fix it. After reading it, you should be familiar with:
- Where to look to diagnose service issues
- How to get services up and running after they go offline
Check Your App’s Logs & Service Stats
The first places to look when a service goes offline is your app’s logs and services stats. These can provide valuable clues as to why a service has gone offline.
Application logs are accessed in your dashboard by clicking on “Logs” in the top nag. Here, you’ll see all errors output by your app. Anything from missing dependencies to maxed-out instances connections could potentially cause a service to go completely offline. In any case, errors should show up in your logs and provide clues as to what happened and what can and should be done to fix it. The Log Management doc has more information about logs.
Service stats provide insight into what resources your service is using and has used. If any of the displayed metrics ever gets overwhelmed, it will likely cause the service to go offline. If you see that one or more of the metrics was overwhelmed before the service went offline, the cause could be a number of things – increased traffic, memory leaks, etc. Your Logs may provide additional clues about the cause.
Low Resource Usage for Offline Services
When a services goes offline, its resource usage will drop and will appear to be “in the green” in your dashboard. The service is not running and actions need to be taken to bring it back online.
Immediate Steps for Recovery
When a service goes offline, there’s things you can do immediately in attempt to bring them back online - Restart, Reboot, Repair. Each can be accessed by clicking the service in your dashboard to expand its details. They should be attempted in the following order. If one doesn't bring the service back online, proceed to the next.
Keep in mind that, depending on the cause of the service going down, these actions may not successfully bring a service back online and if they do, the service may be forced offline again. For example, if a web service is getting overwhelmed by traffic, it can be brought back online, but unless the service is scaled, it will continue to go offline.
"Restart" restarts all of the running processes on a service's instance(s).
"Reboot" shuts down and reboots a service's instance(s), recovering each of the service's processes.
"Repair" provisions new instances based on settings in your Boxfile. For code services, your currently deployed commit is used to create the new instance(s). For data services, all data is migrated to the new instance.
Other Actions You May Need to Take
Depending on the cause of the service going offline, there may be other things you need to do to make sure the service stays online.
You May Need to Change Your Code
You app’s logs and stats will tell you if you need to adjust your code in any way. Common issues that would require a code change are missing dependencies, fatal exceptions, and/or memory leaks.
You May Need to Scale
If your app is getting hit hard and specific services are being overwhelmed and going offline, you need to scale. The How & When to Scale doc outlines different scaling strategies and provides information about what strategies are the most effective.
If you have any questions, suggestions, or corrections, let us know.