You can learn a lot about your site’s audience in aggregate by mapping IP addresses to geolocation. Not just in itself, but joined against other datasets, like census data, store locations, weather and time. [1]
Maxmind makes their GeoLite IP-to-geo database available under an open license (CC-BY-SA)[2]. Out of the box, its columns are beg_ip
, end_ip
, location_id
, where the first two columns show the low and high ends (inclusive) of a range that maps to that location. Every address lies in at most one range; locations may have multiple ranges.
This arrangement caters to range queries in a relational database, but isn’t suitable for our needs. A single IP-geo block can span thousands of addresses.
To get the right locality, take each range and break it at some block level. Instead of having 1.2.3.4
to 1.2.5.6
on one line, let’s use the first three quads (first 24 bits) and emit rows for 1.2.3.4
to 1.2.3.255
, 1.2.4.0
to 1.2.4.255
, and 1.2.5.0
to 1.2.5.6
. This lets us use the first segment as the partition key, and the full ip address as the sort key.
lines bytes description file 15_288_766 1_094_541_688 24-bit partition key maxmind-geolite_city-20121002.tsv 2_288_690 183_223_435 16-bit partition key maxmind-geolite_city-20121002-16.tsv 2_256_627 75_729_432 original (not denormalized) GeoLiteCity-Blocks.csv
////Gently introduce the concept. "So, here’s what range queries are all about, in a nutshell…" Amy////
This is a generally-applicable approach for doing range queries.
-
Choose a regular interval, fine enough to avoid skew but coarse enough to avoid ballooning the dataset size.
-
Whereever a range crosses an interval boundary, split it into multiple records, each filling or lying within a single interval.
-
Emit a compound key of
[interval, join_handle, beg, end]
, where-
interval
is -
join_handle
identifies the originating table, so that records are grouped for a join (this is what ensures If the interval is transparently a prefix of the index (as it is here), you can instead just ship the remainder:[interval, join_handle, beg_suffix, end_suffix]
.
-
-
Use the
In the geodata section, the "quadtile" scheme is (if you bend your brain right) something of an extension on this idea — instead of splitting ranges on regular intervals, we’ll split regions on a regular grid scheme.