Materialized view for play. Query demolition.

ATTN: This introduces a breaking change. The `team` field can no longer be used in the `play` method. Instead, you should use the new `play_player` method to select individual player statistics belonging to a specific team. Otherwise, there are very few public facing changes, but the entire guts of `nfldb.Query` have been ripped out and replaced with more robust SQL generation code. Moreover, several idiosyncracies have been fixed and some unit tests have finally been added. 1. Previously, the `Query` class was doing some very clever things to do parts of a JOIN in Python code. The general flow was that filtering was applied to find primary keys---never using any JOINs---and once all criteria had been applied, those ids were used in a simple SELECT to fetch the actual rows. Now all of that cruft has been removed and replaced with intelligent SQL generation that constructs one query with all the proper JOINs. For whatever reason, I thought this was slower when experimenting with it when I first started nfldb. Perhaps my indexes weren't configured properly then. In any case, I can't really see much performance difference. 2. The SQL generation code is very smart. Although it is not part of nfldb's public API, I imagine it would be very useful if you had some special needs. See the unexported but documented `nfldb.sql` module. 3. Many idiosyncracies resulting from doing a join in Python are now completely gone. For example, if you tried to apply a `sort` with a `limit` with complex search criteria, you were bound to get wrong answers. For example, if you tried sorting by both a column on the `week` table (like `down`) and a column on `play_player` (like `passing_tds`) and applied a limit to it, the results would be completely wonky because the pure Python join can't cope with it performantly. A regular SQL join? Piece of cake. 4. I have added a materialized view `agg_play`. This is a fancy word for "a table that automatically updates itself." In essence, whenever a new row is added to `play_player`, aggregate statistics for that play are re-computed. This makes adding data slower (which doesn't happen very frequently), but it makes querying data much faster and easier. For example, plays can be queried for `passing_yds` without ever joining with `play_player`. (Which is wonky because of the one-to-many relationship.) To reflect this clearer separation of concerns, the `Query.play` method will no longer add criteria that hits the `play_player` table. Instead, if you really want the `play_player` table, then you can use the new `play_player` method. The only field that was accepted in the `play` that is no longer allowed is the `team` and `player_id` fields. This is because there is no sensible way to aggregate these values into a single play. To the best of my knowledge, that is the only possible breaking change here.
BurntSushi · Jul 22, 2014 · 124f659 · 124f659
1 parent 895ec91
commit 124f659
Show file tree

Hide file tree

Showing 8 changed files with 1,211 additions and 1,038 deletions.
diff --git a/nfldb/__init__.py b/nfldb/__init__.py
@@ -66,7 +66,7 @@
 from nfldb.query import Query, QueryOR
 from nfldb.team import standard_team
 from nfldb.types import __pdoc__ as __types_pdoc__
-from nfldb.types import select_columns, stat_categories
+from nfldb.types import stat_categories
 from nfldb.types import Category, Clock, Enums, Drive, FieldPosition, Game
 from nfldb.types import Play, Player, PlayPlayer, PossessionTime, Team
 from nfldb.version import __pdoc__ as __version_pdoc__

diff --git a/nfldb/db.py b/nfldb/db.py
@@ -15,12 +15,9 @@
 
 import nfldb.team
 
-_SHOW_QUERIES = False
-"""When set, all queries will be printed to stderr."""
-
 __pdoc__ = {}
 
-api_version = 6
+api_version = 7
 __pdoc__['api_version'] = \
     """
     The schema version that this library corresponds to. When the schema
@@ -29,6 +26,15 @@
     anything else.
     """
 
+_SHOW_QUERIES = False
+"""When set, all queries will be printed to stderr."""
+
+_NUM_QUERIES = 0
+"""
+The total number of queries executed. Only updated when _SHOW_QUERIES
+is true.
+"""
+
 _config_home = os.getenv('XDG_CONFIG_HOME')
 if not _config_home:
     home = os.getenv('HOME')
@@ -294,6 +300,9 @@ def __enter__(self):
         if _SHOW_QUERIES:
             class _ (object):
                 def execute(self, *args, **kwargs):
+                    global _NUM_QUERIES
+
+                    _NUM_QUERIES += 1
                     c.execute(*args, **kwargs)
                     print(c.query, file=sys.stderr, end='\n\n')
 
@@ -786,3 +795,110 @@ def _migrate_6(c):
         ALTER TABLE game ADD CONSTRAINT game_week_check
             CHECK (week >= 0 AND week <= 25);
     ''')
+
+def _migrate_7(c):
+    from nfldb.types import _player_categories
+
+    print('''
+MIGRATING DATABASE... PLEASE WAIT
+
+THIS WILL ONLY HAPPEN ONCE.
+
+This is currently adding a play aggregation table (a materialized view) derived
+from the `play` and `play_player` tables. Depending on your machine, this 
+should take less than two minutes (this includes aggregating the data and 
+adding indexes).
+
+This aggregation table will automatically update itself when data is added or
+changed.
+''', file=sys.stderr)
+
+    c.execute('''
+        CREATE TABLE agg_play (
+            gsis_id gameid NOT NULL,
+            drive_id usmallint NOT NULL,
+            play_id usmallint NOT NULL,
+            %s,
+            PRIMARY KEY (gsis_id, drive_id, play_id),
+            FOREIGN KEY (gsis_id, drive_id, play_id)
+                REFERENCES play (gsis_id, drive_id, play_id)
+                ON DELETE CASCADE,
+            FOREIGN KEY (gsis_id, drive_id)
+                REFERENCES drive (gsis_id, drive_id)
+                ON DELETE CASCADE,
+            FOREIGN KEY (gsis_id)
+                REFERENCES game (gsis_id)
+                ON DELETE CASCADE
+        )
+    ''' % ', '.join(cat._sql_field for cat in _player_categories.values()))
+    select = ['play.gsis_id', 'play.drive_id', 'play.play_id'] \
+             + ['COALESCE(SUM(play_player.%s), 0)' % cat.category_id
+                for cat in _player_categories.values()]
+    c.execute('''
+        INSERT INTO agg_play
+        SELECT {select}
+        FROM play
+        LEFT JOIN play_player
+        ON (play.gsis_id, play.drive_id, play.play_id)
+           = (play_player.gsis_id, play_player.drive_id, play_player.play_id)
+        GROUP BY play.gsis_id, play.drive_id, play.play_id
+    '''.format(select=', '.join(select)))
+
+    print('Aggregation complete. Adding indexes...', file=sys.stderr)
+    c.execute('''
+        CREATE INDEX agg_play_in_gsis_id
+            ON agg_play (gsis_id ASC);
+        CREATE INDEX agg_play_in_gsis_drive_id
+            ON agg_play (gsis_id ASC, drive_id ASC);
+    ''')
+    for cat in _player_categories.values():
+        c.execute('CREATE INDEX agg_play_in_%s ON agg_play (%s ASC)'
+                  % (cat, cat))
+
+    print('Indexing complete. Adding triggers...', file=sys.stderr)
+    c.execute('''
+        CREATE FUNCTION agg_play_insert() RETURNS trigger AS $$
+            BEGIN
+                INSERT INTO
+                    agg_play (gsis_id, drive_id, play_id)
+                    VALUES   (NEW.gsis_id, NEW.drive_id, NEW.play_id);
+                RETURN NULL;
+            END;
+        $$ LANGUAGE 'plpgsql';
+    ''')
+    c.execute('''
+        CREATE TRIGGER agg_play_sync_insert
+        AFTER INSERT ON play
+        FOR EACH ROW EXECUTE PROCEDURE agg_play_insert();
+    ''')
+
+    def make_sum(field):
+        return 'COALESCE(SUM(play_player.{f}), 0) AS {f}'.format(f=field)
+    select = [make_sum(f.category_id) for f in _player_categories.values()]
+    set_columns = ['{f} = s.{f}'.format(f=f.category_id)
+                   for f in _player_categories.values()]
+    c.execute('''
+        CREATE FUNCTION agg_play_update() RETURNS trigger AS $$
+            BEGIN
+                UPDATE agg_play SET {set_columns}
+                FROM (
+                    SELECT {select}
+                    FROM play
+                    LEFT JOIN play_player
+                    ON (play.gsis_id, play.drive_id, play.play_id)
+                       = (play_player.gsis_id, play_player.drive_id,
+                          play_player.play_id)
+                    WHERE (play.gsis_id, play.drive_id, play.play_id)
+                          = (NEW.gsis_id, NEW.drive_id, NEW.play_id)
+                ) s
+                WHERE (agg_play.gsis_id, agg_play.drive_id, agg_play.play_id)
+                      = (NEW.gsis_id, NEW.drive_id, NEW.play_id);
+                RETURN NULL;
+            END;
+        $$ LANGUAGE 'plpgsql';
+    '''.format(set_columns=', '.join(set_columns), select=', '.join(select)))
+    c.execute('''
+        CREATE TRIGGER agg_play_sync_update
+        AFTER INSERT OR UPDATE ON play_player
+        FOR EACH ROW EXECUTE PROCEDURE agg_play_update();
+    ''')