Hunting Down Stale Devices

Running a logging server with dozens (if not hundreds or thousands) of devices logging to it, you may run into an issue with stale devices. Stale devices are just that- devices that have been configured to log to your server and, for whatever reason, managed to stop logging altogether. It should be logging, but doesn't.

It's a big problem with a few dozen devices. If you're responsible for the care and feeding of hundreds of logging endpoints, then it turns into a pain tracking them down.

I wrote up a script for my work Graylog servers which automate the efforts of identifying and alerting me of stale devices on a weekly basis. My script runs on the local Graylog server via cron. I also have a Stream setup which emails me whenever the stream reports new entries. Here's how I put it all together.

Setting Up The Script

Like my previous scripts, I'm going to use Python to make this one run as well. You'll be needing a couple of extra libraries to make it work. I used Python 2.7.6 on OS X and it runs without any issue. Extra libraries include:

Gelfclient - Simple GELF library for Python used to transmit messages to Graylog
Requests - We'll be using the superb Requests library to query the Graylog API for information

Installing the Libraries

You'll need to run the following on the device you'll be running the Python script on. This requires Root / Administrator access to install the libraries correctly

pip install gelfclient
pip install requests

Configuring the Script

I've copied the script up as a Gistfile. It should be relatively self-explanatory, but I'll go through the highlights, just in case.

myGraylogServer - IP address or FQDN of the Graylog server you will want to query
myGraylogUsername - Username of the account you want to perform queries with
myGraylogPassword - So far, so good. You got this
nowTime - You should think of this as what is the closest time to now that I want to see sources from. Since I run my script on a weekly basis, I'm configuring the nowTime to the past 24 hours (86400 seconds)
previousTime - Same thing. Think of it as what is the earliest period I want to see sources from. I went with one week here, so previousTime is 604800 seconds.

Running the Script

When the script runs, it queries the Graylog server for a list of all devices which have been logging to it in the past 24 hours.
Next, it queries the Graylog server for a list of all devices which have been logging to it over the past week.
If there is no difference between the two, it quietly dies and waits to be run again sometime.
If there is a difference, then a list of sources are generated.
That list is sent to the Graylog server with a source named stale_devices.

Graylog Alerting

From here, it's pretty easy to create a new stream to search for source:stale_devices. You can configure alerting as you see fit, and receive alerts for new entries into the stream as you see fit.

In my setup, I have a small Cron job which runs every Friday morning. So, I get to review and remediate any iffy devices before the weekend happens.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search