Let’s process some real data and, with Santa’s modernized gift-handling system in mind, see how Hadoop orchestrates the flow of data from mappers to reducers.
While Santa is busy year-round, his Reindeer spend their multi-month break between holiday seasons pursuing their favorite hobby: UFOlogy (the study of Unidentified Flying Objects and the search for extraterrestrial civilization). So you can imagine how excited they were to learn about the National UFO Reporting Center data set: 60,000 documented UFO sightings for a sharp-nosed reindeer to investigate. They’d like to send a team to investigate each sighting based on the category of spacecraft: one team investigates multiple-craft formations, one investigates fireballs, and so on. So the first thing to do is assign each sighting to the right team. Since sixty thousand sightings is much higher than a reindeer can count (only four hooves!), let’s help them organize the data. (Sixty thousand records is much too small for Hadoop to be justified, but it’s the perfect size to learn with.) //// Tie in. "…so, similarly, if you were Toyota, and you wanted to record signtings of electric cars (or electric car charging stations in the most populated cities in the United States…" Amy////
The data model for a UFO sighting includes the data fields taken directly from the National UFO Reporting Center eyewitness reports: date of sighting and of report, location, duration, shape of craft and eye-witness description. Your authors have additionally run the free-text locations — "Merrimac, WI" or "Newark, NJ (south of Garden State Pkwy)" — through a geolocation service to (where possible) produce structured location records with an actual longitude, latitude and so forth.
class SimpleUfoSighting include Wu::Model field :sighted_at, Time field :reported_at, Time field :shape, Symbol field :duration_str, String field :location_str, String field :place, Wu::Geo::Place field :description, String # field :longitude, Float field :latitude, Float field :quadkey, String end
The first request from the reindeer team is to organize the sightings into groups by the shape of craft, and to record how many sightings there are for each shape.
In the Chimpanzee&Elephant world, a chimp had the following role:
-
reads and understand each letter
-
creates a new intermediate item having a label (the type of toy) and information about the toy (the work order)
-
hands it to the elephants for delivery to the elf responsible for making that toy type.
We’re going to write a Hadoop "mapper" that performs a similar purpose:
-
reads the raw data and parses it into a structured record
-
creates a new intermediate item having a label (the shape of craft) and information about the sighting (the original record).
-
hands it to Hadoop for delivery to the reducer responsible for that group
The program looks like this:
mapper(:count_ufo_shapes) do consumes UfoSighting, from: json # process do |ufo_sighting| # for each record record = 1 # create a dummy payload, label = ufo_sighting.shape # label with the shape, yield [label, record] # and send it downstream for processing end end
You can test the mapper on the commandline:
$ cat ./data/geo/ufo_sightings/ufo_sightings-sample.json | ./examples/geo/ufo_sightings/count_ufo_shapes.rb --map | head -n25 | wu-lign disk 1972-06-16T05:00:00Z 1999-03-02T06:00:00Z Provo (south of), UT disk several min. Str... sphere 1999-03-02T06:00:00Z 1999-03-02T06:00:00Z Dallas, TX sphere 60 seconds Whi... triangle 1997-07-03T05:00:00Z 1999-03-09T06:00:00Z Bochum (Germany), triangle ca. 2min Tri... light 1998-11-19T06:00:00Z 1998-11-19T06:00:00Z Phoenix (west valley), AZ light 15mim Whi... triangle 1999-02-27T06:00:00Z 1999-02-27T06:00:00Z San Diego, CA triangle 10 minutes cha... triangle 1997-09-15T05:00:00Z 1999-02-17T06:00:00Z Wedgefield, SC triangle 15 min Tra... ...
The output is simply the partitioning label (UFO shape), followed by the attributes of the signing, separated by tabs. The framework uses the first field to group/sort by default; the rest is cargo.
Just as the pygmy elephants transported work orders to elves' workbenches, Hadoop delivers each record to the 'reducer', the second stage of our job.
reducer(:count_sightings) do def process_group(label, group) count = 0 group.each do |record| # on each record, count += 1 # increment the count yield record # re-output the record end # yield ['# count:', label, count] # at end of group, summarize end end
The elf at each workbench saw a series of work orders, with the guarantee that a) work orders for each toy type are delivered together and in order; and b) this was the only workbench to receive work orders for that toy type.
Similarly, the reducer receives a series of records, grouped by label, with a guarantee that it is the unique processor for such records. All we have to do here is re-emit records as they come in, then add a line following each group with its count. We’ve put a '#' at the start of the summary lines, which lets you easily filter them.
Test the full mapper-sort-reducer stack from the commandline:
$ cat ./data/geo/ufo_sightings/ufo_sightings-sample.json | ./examples/geo/ufo_sightings/count_ufo_shapes.rb --map | sort | ./examples/geo/ufo_sightings/count_ufo_shapes.rb --reduce | wu-lign
1985-06-01T05:00:00Z 1999-01-14T06:00:00Z North Tonawanda, NY chevron 1 hr 7 lights in a chevron shape not sure it was one object lighted or 7 s 1999-01-20T06:00:00Z 1999-01-31T06:00:00Z Olney, IL chevron 10 seconds Stargazing, saw a dimly lit V-shape coming overhaed from west t east, 1998-12-16T06:00:00Z 1998-12-16T06:00:00Z Lubbock, TX chevron 3 minutes Object southbound, displaying three white lights, slowed, hovered, qu # count: chevron 3 1999-01-16T06:00:00Z 1999-01-16T06:00:00Z Deptford, NJ cigar 2 Hours An aircraft of some type was seen in the sky with approximately five # count: cigar 1 1947-10-15T06:00:00Z 1999-02-25T06:00:00Z Palmira, circle 1 hour After a concert given in the small town of Palmira, Colombia, a grou 1999-01-10T06:00:00Z 1999-01-11T06:00:00Z Tyson's Corner, VA circle 1 to 2 sec Bright green circularly shaped light moved downward and easterly thro ...
Great! That’s enough for the reindeer to start their research. We’ll come back in a bit and help them plan their summer tour. /// It’s fine to say you’re going to come back to it, but say just a bit more rather than totally leaving this dangling. Give a bit more here in your preview of what’s to come. Or, you could just omit any mention of coming back to this. But I suggest the former. Amy////