Skip to content

Latest commit

 

History

History
107 lines (81 loc) · 7.29 KB

03c-simple_reshape.asciidoc

File metadata and controls

107 lines (81 loc) · 7.29 KB

Close Encounters of the Reindeer Kind (pt 1)

Let’s process some real data and, with Santa’s modernized gift-handling system in mind, see how Hadoop orchestrates the flow of data from mappers to reducers.

While Santa is busy year-round, his Reindeer spend their multi-month break between holiday seasons pursuing their favorite hobby: UFOlogy (the study of Unidentified Flying Objects and the search for extraterrestrial civilization). So you can imagine how excited they were to learn about the National UFO Reporting Center data set: 60,000 documented UFO sightings for a sharp-nosed reindeer to investigate. They’d like to send a team to investigate each sighting based on the category of spacecraft: one team investigates multiple-craft formations, one investigates fireballs, and so on. So the first thing to do is assign each sighting to the right team. Since sixty thousand sightings is much higher than a reindeer can count (only four hooves!), let’s help them organize the data. (Sixty thousand records is much too small for Hadoop to be justified, but it’s the perfect size to learn with.) //// Tie in. "…​so, similarly, if you were Toyota, and you wanted to record signtings of electric cars (or electric car charging stations in the most populated cities in the United States…​" Amy////

UFO Sighting Data Model

The data model for a UFO sighting includes the data fields taken directly from the National UFO Reporting Center eyewitness reports: date of sighting and of report, location, duration, shape of craft and eye-witness description. Your authors have additionally run the free-text locations — "Merrimac, WI" or "Newark, NJ (south of Garden State Pkwy)" — through a geolocation service to (where possible) produce structured location records with an actual longitude, latitude and so forth.

class SimpleUfoSighting
  include Wu::Model
  field :sighted_at,   Time
  field :reported_at,  Time
  field :shape,        Symbol
  field :duration_str, String
  field :location_str, String
  field :place,        Wu::Geo::Place
  field :description,  String
  #
  field :longitude,  Float
  field :latitude,   Float
  field :quadkey,    String
end

Group the UFOs by Shape

The first request from the reindeer team is to organize the sightings into groups by the shape of craft, and to record how many sightings there are for each shape.

Mapper

In the Chimpanzee&Elephant world, a chimp had the following role:

  • reads and understand each letter

  • creates a new intermediate item having a label (the type of toy) and information about the toy (the work order)

  • hands it to the elephants for delivery to the elf responsible for making that toy type.

We’re going to write a Hadoop "mapper" that performs a similar purpose:

  • reads the raw data and parses it into a structured record

  • creates a new intermediate item having a label (the shape of craft) and information about the sighting (the original record).

  • hands it to Hadoop for delivery to the reducer responsible for that group

The program looks like this:

 	mapper(:count_ufo_shapes) do
  consumes UfoSighting, from: json
  #
  process do |ufo_sighting|      # for each record
    record = 1                   # create a dummy payload,
    label  = ufo_sighting.shape  # label with the shape,
           yield [label, record]        # and send it downstream for processing
  end
end

You can test the mapper on the commandline:

       $ cat ./data/geo/ufo_sightings/ufo_sightings-sample.json   |
    ./examples/geo/ufo_sightings/count_ufo_shapes.rb --map |
    head -n25 | wu-lign
 disk	   1972-06-16T05:00:00Z	1999-03-02T06:00:00Z	Provo (south of), UT     	disk     	several min.   	Str...
 sphere	   1999-03-02T06:00:00Z	1999-03-02T06:00:00Z	Dallas, TX              	sphere  	60 seconds     	Whi...
 triangle  1997-07-03T05:00:00Z	1999-03-09T06:00:00Z	Bochum (Germany),       	triangle	ca. 2min       	Tri...
 light	   1998-11-19T06:00:00Z	1998-11-19T06:00:00Z	Phoenix (west valley), AZ	light   	15mim          	Whi...
 triangle  1999-02-27T06:00:00Z	1999-02-27T06:00:00Z	San Diego, CA            	triangle	10 minutes     	cha...
 triangle  1997-09-15T05:00:00Z	1999-02-17T06:00:00Z	Wedgefield, SC             	triangle	15 min         	Tra...
...

The output is simply the partitioning label (UFO shape), followed by the attributes of the signing, separated by tabs. The framework uses the first field to group/sort by default; the rest is cargo.

Reducer

Just as the pygmy elephants transported work orders to elves' workbenches, Hadoop delivers each record to the 'reducer', the second stage of our job.

   reducer(:count_sightings) do
     def process_group(label, group)
count = 0
group.each do |record|           # on each record,
  count += 1                     #   increment the count
  yield record                   #   re-output the record
end                              #
yield ['# count:', label, count] # at end of group, summarize
     end
   end

The elf at each workbench saw a series of work orders, with the guarantee that a) work orders for each toy type are delivered together and in order; and b) this was the only workbench to receive work orders for that toy type.

Similarly, the reducer receives a series of records, grouped by label, with a guarantee that it is the unique processor for such records. All we have to do here is re-emit records as they come in, then add a line following each group with its count. We’ve put a '#' at the start of the summary lines, which lets you easily filter them.

Test the full mapper-sort-reducer stack from the commandline:

$ cat ./data/geo/ufo_sightings/ufo_sightings-sample.json      |
    ./examples/geo/ufo_sightings/count_ufo_shapes.rb --map    | sort |
    ./examples/geo/ufo_sightings/count_ufo_shapes.rb --reduce | wu-lign
1985-06-01T05:00:00Z	1999-01-14T06:00:00Z	North Tonawanda, NY  	chevron  	1 hr 	7 lights in a chevron shape not sure it was one object lighted or 7 s
1999-01-20T06:00:00Z	1999-01-31T06:00:00Z	Olney, IL            	chevron  	10 seconds        	Stargazing, saw a dimly lit V-shape coming overhaed from west t east,
1998-12-16T06:00:00Z	1998-12-16T06:00:00Z	Lubbock, TX          	chevron  	3 minutes         	Object southbound, displaying three white lights, slowed, hovered, qu
# count:	chevron	3
1999-01-16T06:00:00Z	1999-01-16T06:00:00Z	Deptford, NJ         	cigar    	2 Hours           	An aircraft of some type was seen in the sky with approximately five
# count:	cigar	1
1947-10-15T06:00:00Z	1999-02-25T06:00:00Z	Palmira,             	circle   	1 hour            	After a concert given in the small town of Palmira, Colombia,  a grou
1999-01-10T06:00:00Z	1999-01-11T06:00:00Z	Tyson's Corner, VA   	circle   	1 to 2 sec        	Bright green circularly shaped light moved downward and easterly thro
...

Great! That’s enough for the reindeer to start their research. We’ll come back in a bit and help them plan their summer tour. /// It’s fine to say you’re going to come back to it, but say just a bit more rather than totally leaving this dangling. Give a bit more here in your preview of what’s to come. Or, you could just omit any mention of coming back to this. But I suggest the former. Amy////