Skip to content

Commit

Permalink
Materialized view for play. Query demolition.
Browse files Browse the repository at this point in the history
ATTN: This introduces a breaking change. The `team` field can no longer
be used in the `play` method. Instead, you should use the new
`play_player` method to select individual player statistics belonging to
a specific team.

Otherwise, there are very few public facing changes, but the entire
guts of `nfldb.Query` have been ripped out and replaced with more
robust SQL generation code. Moreover, several idiosyncracies have been
fixed and some unit tests have finally been added.

1. Previously, the `Query` class was doing some very clever things to do
   parts of a JOIN in Python code. The general flow was that filtering
   was applied to find primary keys---never using any JOINs---and once
   all criteria had been applied, those ids were used in a simple SELECT
   to fetch the actual rows.

   Now all of that cruft has been removed and replaced with intelligent
   SQL generation that constructs one query with all the proper JOINs.
   For whatever reason, I thought this was slower when experimenting
   with it when I first started nfldb. Perhaps my indexes weren't
   configured properly then. In any case, I can't really see much
   performance difference.

2. The SQL generation code is very smart. Although it is not part of
   nfldb's public API, I imagine it would be very useful if you had some
   special needs. See the unexported but documented `nfldb.sql` module.

3. Many idiosyncracies resulting from doing a join in Python are now
   completely gone. For example, if you tried to apply a `sort` with a
   `limit` with complex search criteria, you were bound to get wrong
   answers. For example, if you tried sorting by both a column on the
   `week` table (like `down`) and a column on `play_player` (like
   `passing_tds`) and applied a limit to it, the results would be
   completely wonky because the pure Python join can't cope with it
   performantly. A regular SQL join? Piece of cake.

4. I have added a materialized view `agg_play`. This is a fancy word for
   "a table that automatically updates itself." In essence, whenever a
   new row is added to `play_player`, aggregate statistics for that play
   are re-computed. This makes adding data slower (which doesn't happen
   very frequently), but it makes querying data much faster and easier.
   For example, plays can be queried for `passing_yds` without ever
   joining with `play_player`. (Which is wonky because of the
   one-to-many relationship.)
   To reflect this clearer separation of concerns, the `Query.play`
   method will no longer add criteria that hits the `play_player` table.
   Instead, if you really want the `play_player` table, then you can use
   the new `play_player` method. The only field that was accepted in the
   `play` that is no longer allowed is the `team` and `player_id`
   fields. This is because there is no sensible way to aggregate these
   values into a single play.

   To the best of my knowledge, that is the only possible breaking
   change here.
  • Loading branch information
BurntSushi committed Jul 22, 2014
1 parent 895ec91 commit 124f659
Show file tree
Hide file tree
Showing 8 changed files with 1,211 additions and 1,038 deletions.
2 changes: 1 addition & 1 deletion nfldb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
from nfldb.query import Query, QueryOR
from nfldb.team import standard_team
from nfldb.types import __pdoc__ as __types_pdoc__
from nfldb.types import select_columns, stat_categories
from nfldb.types import stat_categories
from nfldb.types import Category, Clock, Enums, Drive, FieldPosition, Game
from nfldb.types import Play, Player, PlayPlayer, PossessionTime, Team
from nfldb.version import __pdoc__ as __version_pdoc__
Expand Down
124 changes: 120 additions & 4 deletions nfldb/db.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,9 @@

import nfldb.team

_SHOW_QUERIES = False
"""When set, all queries will be printed to stderr."""

__pdoc__ = {}

api_version = 6
api_version = 7
__pdoc__['api_version'] = \
"""
The schema version that this library corresponds to. When the schema
Expand All @@ -29,6 +26,15 @@
anything else.
"""

_SHOW_QUERIES = False
"""When set, all queries will be printed to stderr."""

_NUM_QUERIES = 0
"""
The total number of queries executed. Only updated when _SHOW_QUERIES
is true.
"""

_config_home = os.getenv('XDG_CONFIG_HOME')
if not _config_home:
home = os.getenv('HOME')
Expand Down Expand Up @@ -294,6 +300,9 @@ def __enter__(self):
if _SHOW_QUERIES:
class _ (object):
def execute(self, *args, **kwargs):
global _NUM_QUERIES

_NUM_QUERIES += 1
c.execute(*args, **kwargs)
print(c.query, file=sys.stderr, end='\n\n')

Expand Down Expand Up @@ -786,3 +795,110 @@ def _migrate_6(c):
ALTER TABLE game ADD CONSTRAINT game_week_check
CHECK (week >= 0 AND week <= 25);
''')

def _migrate_7(c):
from nfldb.types import _player_categories

print('''
MIGRATING DATABASE... PLEASE WAIT
THIS WILL ONLY HAPPEN ONCE.
This is currently adding a play aggregation table (a materialized view) derived
from the `play` and `play_player` tables. Depending on your machine, this
should take less than two minutes (this includes aggregating the data and
adding indexes).
This aggregation table will automatically update itself when data is added or
changed.
''', file=sys.stderr)

c.execute('''
CREATE TABLE agg_play (
gsis_id gameid NOT NULL,
drive_id usmallint NOT NULL,
play_id usmallint NOT NULL,
%s,
PRIMARY KEY (gsis_id, drive_id, play_id),
FOREIGN KEY (gsis_id, drive_id, play_id)
REFERENCES play (gsis_id, drive_id, play_id)
ON DELETE CASCADE,
FOREIGN KEY (gsis_id, drive_id)
REFERENCES drive (gsis_id, drive_id)
ON DELETE CASCADE,
FOREIGN KEY (gsis_id)
REFERENCES game (gsis_id)
ON DELETE CASCADE
)
''' % ', '.join(cat._sql_field for cat in _player_categories.values()))
select = ['play.gsis_id', 'play.drive_id', 'play.play_id'] \
+ ['COALESCE(SUM(play_player.%s), 0)' % cat.category_id
for cat in _player_categories.values()]
c.execute('''
INSERT INTO agg_play
SELECT {select}
FROM play
LEFT JOIN play_player
ON (play.gsis_id, play.drive_id, play.play_id)
= (play_player.gsis_id, play_player.drive_id, play_player.play_id)
GROUP BY play.gsis_id, play.drive_id, play.play_id
'''.format(select=', '.join(select)))

print('Aggregation complete. Adding indexes...', file=sys.stderr)
c.execute('''
CREATE INDEX agg_play_in_gsis_id
ON agg_play (gsis_id ASC);
CREATE INDEX agg_play_in_gsis_drive_id
ON agg_play (gsis_id ASC, drive_id ASC);
''')
for cat in _player_categories.values():
c.execute('CREATE INDEX agg_play_in_%s ON agg_play (%s ASC)'
% (cat, cat))

print('Indexing complete. Adding triggers...', file=sys.stderr)
c.execute('''
CREATE FUNCTION agg_play_insert() RETURNS trigger AS $$
BEGIN
INSERT INTO
agg_play (gsis_id, drive_id, play_id)
VALUES (NEW.gsis_id, NEW.drive_id, NEW.play_id);
RETURN NULL;
END;
$$ LANGUAGE 'plpgsql';
''')
c.execute('''
CREATE TRIGGER agg_play_sync_insert
AFTER INSERT ON play
FOR EACH ROW EXECUTE PROCEDURE agg_play_insert();
''')

def make_sum(field):
return 'COALESCE(SUM(play_player.{f}), 0) AS {f}'.format(f=field)
select = [make_sum(f.category_id) for f in _player_categories.values()]
set_columns = ['{f} = s.{f}'.format(f=f.category_id)
for f in _player_categories.values()]
c.execute('''
CREATE FUNCTION agg_play_update() RETURNS trigger AS $$
BEGIN
UPDATE agg_play SET {set_columns}
FROM (
SELECT {select}
FROM play
LEFT JOIN play_player
ON (play.gsis_id, play.drive_id, play.play_id)
= (play_player.gsis_id, play_player.drive_id,
play_player.play_id)
WHERE (play.gsis_id, play.drive_id, play.play_id)
= (NEW.gsis_id, NEW.drive_id, NEW.play_id)
) s
WHERE (agg_play.gsis_id, agg_play.drive_id, agg_play.play_id)
= (NEW.gsis_id, NEW.drive_id, NEW.play_id);
RETURN NULL;
END;
$$ LANGUAGE 'plpgsql';
'''.format(set_columns=', '.join(set_columns), select=', '.join(select)))
c.execute('''
CREATE TRIGGER agg_play_sync_update
AFTER INSERT OR UPDATE ON play_player
FOR EACH ROW EXECUTE PROCEDURE agg_play_update();
''')
Loading

0 comments on commit 124f659

Please sign in to comment.