Webserver Log Parsing

We’ll represent loglines with the following model definition:

link:code/serverlogs/models/logline--fields.rb[role=include]

Since most of our questions are about what visitors do, we’ll mainly use visitor_id (to identify common requests for a visitor), uri_str (what they requested), requested_at (when they requested it) and referer (the page they clicked on). Don’t worry if you’re not deeply familiar with the rest of the fields in our model — they’ll become clear in context.

Two notes, though. In these explorations, we’re going to use the ip_address field for the visitor_id. That’s good enough if you don’t mind artifacts, like every visitor from the same coffee shop being treated identically. For serious use, though, many web applications assign an identifying "cookie" to each visitor and add it as a custom logline field. Following good practice, we’ve built this model with a visitor_id method that decouples the semantics ("visitor") from the data ("the IP address they happen to have visited from"). Also please note that, though the dictionary blesses the term 'referrer', the early authors of the web used the spelling 'referer' and we’re now stuck with it in this context. ///Here is a great example of supporting with real-world analogies, above where you wrote, "like every visitor from the same coffee shop being treated identically…" Yes! That is the kind of simple, sophisticated tying-together type of connective tissue needed throughout, to greater and lesser degrees. Amy////

Simple Log Parsing

////Help the reader in by writing something quick, short, like, "The core nugget that you should know about simple log parsing is…" Amy////

Webserver logs typically follow some variant of the "Apache Common Log" format — a series of lines describing each web request:

154.5.248.92 - - [30/Apr/2003:13:17:04 -0700] "GET /random/video/Star_Wars_Kid.wmv HTTP/1.0" 206 176250 "-" "Mozilla/4.77 (Macintosh; U; PPC)"

Our first task is to leave that arcane format behind and extract healthy structured models. Since every line stands alone, the parse script is simple as can be: a transform-only script that passes each line to the Logline.parse method and emits the model object it returns.

link:code/serverlogs/parser--processor.rb[role=include]

Star Wars Kid serverlogs

For sample data, we’ll use the webserver logs released by blogger Andy Baio. In 2003, he posted the famous "Star Wars Kid" video, which for several years ranked as the biggest viral video of all time. (It augments a teenager’s awkward Jedi-style fighting moves with the special effects of a real lightsaber.) Here’s his description:

I’ve decided to release the first six months of server logs from the meme’s spread into the public domain — with dates, times, IP addresses, user agents, and referer information. … On April 29 at 4:49pm, I posted the video, renamed to "Star_Wars_Kid.wmv" — inadvertently giving the meme its permanent name. (Yes, I coined the term "Star Wars Kid." It’s strange to think it would’ve been "Star Wars Guy" if I was any lazier.) From there, for the first week, it spread quickly through news site, blogs and message boards, mostly oriented around technology, gaming, and movies. …

This file is a subset of the Apache server logs from April 10 to November 26, 2003. It contains every request for my homepage, the original video, the remix video, the mirror redirector script, the donations spreadsheet, and the seven blog entries I made related to Star Wars Kid. I included a couple weeks of activity before I posted the videos so you can determine the baseline traffic I normally received to my homepage. The data is public domain. If you use it for anything, please drop me a note!

The details of parsing are mostly straightforward — we use a regular expression to pick apart the fields in each line. That regular expression, however, is another story:

link:code/serverlogs/models/logline--parse.rb[role=include]

It may look terrifying, but taken piece-by-piece it’s not actually that bad. Regexp-fu is an essential skill for data science in practice — you’re well advised to walk through it. Let’s do so.

The meat of each line describe the contents to match — \S+ for "a sequence of non-whitespace", \d+ for "a sequence of digits", and so forth. If you’re not already familiar with regular expressions at that level, consult the excellent tutorial at regular-expressions.info.
This is an 'extended-form' regular expression, as requested by the x at the end of it. An extended-form regexp ignores whitespace and treats # as a comment delimiter — constructing a regexp this complicated would be madness otherwise. Be careful to backslash-escape spaces and hash marks.
The \A and \z anchor the regexp to the absolute start and end of the string respectively.
Fields are selected using 'named capture group' syntax: (?<ip_address>\S+). You can retrieve its contents using match[:ip_address], or get all of them at once using captures_hash as we do in the parse method.
Build your regular expressions to be 'good brittle'. If you only expect HTTP request methods to be uppercase strings, make your program reject records that are otherwise. When you’re processing billions of events, those one-in-a-million deviants start occurring thousands of times.

That regular expression does almost all the heavy lifting, but isn’t sufficient to properly extract the requested_at time. Wukong models provide a "security gate" for each field in the form of the receive_(field name) method. The setter method (requested_at=) applies a new value directly, but the receive_requested_at method is expected to appropriately validate and transform the given value. The default method performs simple 'do the right thing'-level type conversion, sufficient to (for example) faithfully load an object from a plain JSON hash. But for complicated cases you’re invited to override it as we do here.

link:/Users/flip/ics/book/big_data_for_chimps/code/serverlogs/models/logline--requested_at.rb[role=include]

There’s a general lesson here for data-parsing scripts. Don’t try to be a hero and get everything done in one giant method. The giant regexp just coarsely separates the values; any further special handling happens in isolated methods.

Test the script in local mode:

link:code/serverlogs/output-parser-map.log[role=include]

Then run it on the full dataset to produce the starting point for the rest of our work:

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

07a-serverlog_parsing.asciidoc

07a-serverlog_parsing.asciidoc

Webserver Log Parsing

Simple Log Parsing

Files

07a-serverlog_parsing.asciidoc

Latest commit

History

07a-serverlog_parsing.asciidoc

File metadata and controls

Webserver Log Parsing

Simple Log Parsing