Hunting Down Stale Devices

Running a logging server with dozens (if not hundreds or thousands) of devices logging to it, you may run into an issue with stale devices. Stale devices are just that- devices that have been configured to log to your server and, for whatever reason, managed to stop logging altogether. It should be logging, but doesn't.

It's a big problem with a few dozen devices. If you're responsible for the care and feeding of hundreds of logging endpoints, then it turns into a pain tracking them down.

I wrote up a script for my work Graylog servers which automate the efforts of identifying and alerting me of stale devices on a weekly basis. My script runs on the local Graylog server via cron. I also have a Stream setup which emails me whenever the stream reports new entries. Here's how I put it all together.

Setting Up The Script

Like my previous scripts, I'm going to use Python to make this one run as well. You'll be needing a couple of extra libraries to make it work. I used Python 2.7.6 on OS X and it runs without any issue. Extra libraries include:

Installing the Libraries

You'll need to run the following on the device you'll be running the Python script on. This requires Root / Administrator access to install the libraries correctly

pip install gelfclient
pip install requests

Configuring the Script

I've copied the script up as a Gistfile. It should be relatively self-explanatory, but I'll go through the highlights, just in case.

Running the Script

  1. When the script runs, it queries the Graylog server for a list of all devices which have been logging to it in the past 24 hours.
  2. Next, it queries the Graylog server for a list of all devices which have been logging to it over the past week.
  3. If there is no difference between the two, it quietly dies and waits to be run again sometime.
  4. If there is a difference, then a list of sources are generated.
  5. That list is sent to the Graylog server with a source named stale_devices.

Graylog Alerting

From here, it's pretty easy to create a new stream to search for source:stale_devices. You can configure alerting as you see fit, and receive alerts for new entries into the stream as you see fit.

In my setup, I have a small Cron job which runs every Friday morning. So, I get to review and remediate any iffy devices before the weekend happens.