Learning Club 00.b: Setup your development environment (Get started with python package nfldb)

So, just a few days ago I posted Learning Club 00: Set up your development environment (Getting started with R). There I made a mistake and decided to use R without thinking about the data set I would use. I am still happy I wrote the post because it can give all the R users in the learning club an introduction on how to setup their R environment and which resources to use. But still I have to update you about my environment 🙂

And since I don’t know what activities will come up in the learning club, at least now I have R and python available. So I am going to prep

NFL data

For the learning club, or at least the first tasks, I want to use NFL (National Football League) data. It is not so easy to get hold of sports data, since e.g. the TV channels charge lots of money for their APIs. So, after exploring programmableweb.com and mashape.com, I found that most APIs cost a lot of money. My boyfriend searched the web a little further and found the python library nfldb by BurntSushi which comes with a postgres DB that you need to install on your computer/server.

Database and server

The data is now stored in a postgres database on our debian server.

BurntSushi provides an ER diagram which can be useful for using the data.

Accessing the data

As already mentioned, the package nfldb provides several methods to access the data. The author also provides API documentation and a great wiki.

SQL

The data can be accessed via command line using psql:

psql -U nfldb

Then it will ask for your password:

Passwort für Benutzer nfldb:

Then you will get a command prompt:

nfldb=>

This is how you get help:

nfldb=>\?

This is how you get all tables:

nfldb=>\d

This is the output (sorry it’s in German):

            Liste der Relationen
 Schema |    Name     |   Typ   | Eigentümer
--------+-------------+---------+------------
 public | agg_play    | Tabelle | nfldb
 public | drive       | Tabelle | nfldb
 public | game        | Tabelle | nfldb
 public | meta        | Tabelle | nfldb
 public | play        | Tabelle | nfldb
 public | play_player | Tabelle | nfldb
 public | player      | Tabelle | nfldb
 public | team        | Tabelle | nfldb
(8 Zeilen)

And this is how you get information about table “play”:

nfldb=>\d play
 first_down         | smallint             | not null Vorgabewert 0
 fourth_down_att    | smallint             | not null Vorgabewert 0
 fourth_down_conv   | smallint             | not null Vorgabewert 0
 fourth_down_failed | smallint             | not null Vorgabewert 0
 passing_first_down | smallint             | not null Vorgabewert 0
 penalty            | smallint             | not null Vorgabewert 0
 penalty_first_down | smallint             | not null Vorgabewert 0
 penalty_yds        | smallint             | not null Vorgabewert 0
 rushing_first_down | smallint             | not null Vorgabewert 0
 third_down_att     | smallint             | not null Vorgabewert 0
 third_down_conv    | smallint             | not null Vorgabewert 0
 third_down_failed  | smallint             | not null Vorgabewert 0
 timeout            | smallint             | not null Vorgabewert 0
 xp_aborted         | smallint             | not null Vorgabewert 0
Indexe:
    "play_pkey" PRIMARY KEY, btree (gsis_id, drive_id, play_id)
    "play_in_down" btree (down)
    "play_in_first_down" btree (first_down)
    "play_in_fourth_down_att" btree (fourth_down_att)
    "play_in_fourth_down_conv" btree (fourth_down_conv)
    "play_in_fourth_down_failed" btree (fourth_down_failed)
    "play_in_gsis_drive_id" btree (gsis_id, drive_id)
    "play_in_gsis_id" btree (gsis_id)
    "play_in_passing_first_down" btree (passing_first_down)
    "play_in_penalty" btree (penalty)
    "play_in_penalty_first_down" btree (penalty_first_down)
    "play_in_penalty_yds" btree (penalty_yds)
    "play_in_pos_team" btree (pos_team)
    "play_in_rushing_first_down" btree (rushing_first_down)
    "play_in_third_down_att" btree (third_down_att)
    "play_in_third_down_conv" btree (third_down_conv)
    "play_in_third_down_failed" btree (third_down_failed)
    "play_in_time" btree ((("time").phase), (("time").elapsed))
    "play_in_timeout" btree (timeout)
    "play_in_xp_aborted" btree (xp_aborted)
    "play_in_yardline" btree (((yardline).pos))
    "play_in_yards_to_go" btree (yards_to_go DESC)
Check-Constraints:
    "play_down_check" CHECK (down >= 1 AND down <= 4)
    "play_yards_to_go_check" CHECK (yards_to_go >= 0 AND yards_to_go <= 100)
Fremdschlüssel-Constraints:
    "play_gsis_id_fkey" FOREIGN KEY (gsis_id, drive_id) REFERENCES drive(gsis_id, drive_id) ON DELETE CASCADE
    "play_gsis_id_fkey1" FOREIGN KEY (gsis_id) REFERENCES game(gsis_id) ON DELETE CASCADE
    "play_pos_team_fkey" FOREIGN KEY (pos_team) REFERENCES team(team_id) ON UPDATE CASCADE ON DELETE RESTRICT
Fremdschlüsselverweise von:
    TABLE "agg_play" CONSTRAINT "agg_play_gsis_id_fkey" FOREIGN KEY (gsis_id, drive_id, play_id) REFERENCES play(gsis_id, drive_id, play_id) ON DELETE CASCADE
    TABLE "play_player" CONSTRAINT "play_player_gsis_id_fkey" FOREIGN KEY (gsis_id, drive_id, play_id) REFERENCES play(gsis_id, drive_id, play_id) ON DELETE CASCADE
Trigger:
    agg_play_sync_insert AFTER INSERT ON play FOR EACH ROW EXECUTE PROCEDURE agg_play_insert()

python

Of course, you can also access your database using the python package nfldb. To test if everything works, I copied one of the first examples on the github page and changed it a little.

## import library
import nfldb

## connect to db
db = nfldb.connect()
q = nfldb.Query(db)

## analyse nfldb
q.game(season_year=2014, season_type="Regular")
for pp in q.sort("passing_yds").limit(7).as_aggregate():
	print pp.player, pp.passing_yds

I saved this file as 00data.py and called:

python 00data.py

The output was:

Drew Brees (NO, QB) 4952
Andrew Luck (IND, QB) 4761
Peyton Manning (DEN, QB) 4727
Matt Ryan (ATL, QB) 4694
Ben Roethlisberger (PIT, QB) 4659
Eli Manning (NYG, QB) 4410
Aaron Rodgers (GB, QB) 4381

Final words

I think the dataset is really cool because it has some relations and it is not just a data frame/matrix with samples and features like many training data sets. I think it will be interesting to analyze it, make visualizations and try to find some structure.

This was just a very introductory post because I just started using the db and the package myself. You will definitely get more insight when you follow my posts regarding the nfldb and the learning club.

Leave a Reply

Your email address will not be published.