Recently, I receeived a request from management to provide raw system logs to one of our clients for their internal auditing purposes. They will need logs on an ongoing basis, and have no logging infrastructure on their end which could receive parsed out Graylog events. Effectively, they were looking for 'big text files' of logs.

Due to security reasons (as well as network configuration constraints), allowing them access to our Graylog instance is a non-starter. So, my journey to get logs out of graylog began in earnest. I attempted several methods and ended up writing my own Gelf listener for events. But, let's see why.

Logstash To The Rescue

Love Logstash. She's one stop shopping when it comes to receiving, transmuting, filtering, and forwarding logs to a variety of sources. I use an older version of Logstash to receive SSL encoded JSON streams, filter, convert to Gelf, and forward off to one of my Graylog servers.

I setup a demo Logstash instance, created a Graylog Stream Rule and fired things off. Thus began two days of troubleshooting why Logstash UDP listener kept dying. As it turns out:

  1. UDP output from Graylog seems to be terminated by an extra NULL character (\0), which Logstash does not expect (and causes the input to crash in quite a spectacular fashion). There seems to be no way around it, unfortunately. Stopped debugging after I located the bug report at Github.
  2. (as of this writing), Logstash does not support TCP Gelf events. Graylog can send via TCP Gelf, but there's nothing that can listen to it. Same thing, really. Located the feature request at Github after confirming it dosen't work.

Let's Query Graylog Directly

I think I spend over half of my time using Graylog's Superb API than I do in the web interface itself. I've developed several PHP pages which query the Graylog API directly and am extremely comfortable with using it, and handling the limitations.

Actually, I'm kidding. There's really no limitations to the API page, unless you're querying for all the logs from a noisy server. Once the data return hops over 25,000 records, you're face-to-face with an unresponsive Graylog server complely out of memory and flopping around until you restart it and elasticsearch to regain control.

But if you look at it, you're basically querying one Java instance (Graylog), which just queries Java on another server (Elasticsearch). Graylog formats the results and returns them to you. What happens when you query Elasticsearch directly?

What Happens When You Query Elasticsearch Directly

I've got a love-hate relationship with ES. It's a superb database for high volume searching, and Graylog hides a lot of the complexity behind the API and wonderful web interface. That being said, talking to ES directly forces one to write monsterous queries like this to get data back:

    # main es search query. _here be dragons_.
    res = es.search(body={
        "from": 0,
        "size": myLimit,
        "query": {
            "filtered": {
                "query": {
                    "query_string": {
                        "query": deviceInsert,
                        "allow_leading_wildcard": "false"
                        }
                        },
                    "filter": {
                        "bool": {
                            "must": {
                                "range": {
                                    "timestamp": {
                                        "from": earlierInsert,
                                        "to": nowInsert,
                                        "include_lower": "true",
                                        "include_upper": "true"
                                        }
                                        }
                                    }
                                }
                            }
                        }
                },
            "sort": [
                {
                    "timestamp": {
                        "order": "desc"
                        }
                        }
                ]
                },
                    request_timeout=dynamicTimeoutValue)

While I was able to make the query configurable by adding in a bunch of variables in relevant places, and increasing the timeout values, I was still stuck with the same issues I had with Graylog. Namely, slow responses to massive queries.

It's not useful to query an hour's worth of logs from 5 systems if the query and parsing takes a long time to run. Querying takes memory and CPU from a system that's accepting a large number of logs to begin with.

My Failures, In Review

So far, I've attempted (and failed) to:

  • Send UDP Gelf messages from Graylog (due to software bug)
  • Send TCP Gelf messages from Graylog (performance enhancement not in place)
  • Bulk query Graylog API for messages (dismal performance)
  • Bulk query Elasticsearch for messages (slightly less dismal performance, but still poor when you'll be querying for 10k logs an hour from multiple devices)

So, I Wrote My Own Listener

The internet is replete with a good number of Gelf Sending libraries. I've personally used gelfclient on a number of occasions and find it to be perfect for the task of sending out Gelf messages, but there's really nothing written that receives Gelf messages.

Thankfully, the Gelf format is a pretty straightforward format to follow:

  • UDP datagrams are usually limited to a size of 8192 bytes.
  • GELF messages can be sent uncompressed, GZIP’d or ZLIB’d
  • It's a crummy JSON string

I wrote up a quick Gist which has a ready-to-run Python script which listens on UDP 12201 and processes any incoming Gelf messages. By default, it's going to send all received events to a fileWriter function, which just writes plain events to disk.

Super basic, but it scratches one particular itch that I had on a work project. There's plenty of room to expand it and feel free to hack away. Thought I'd pass it along in case anyone else was looking for something similiar.

I have some ideas for processing the input further, but I'll leave that one for a future post.