The first activity of the data science learning club I am participating in is to find and explore a dataset. I already described the data I found and will use in the last post. You can follow all my learning club related activities here.
The tasks of this activity are (quoted from the thread above):
Learn how to import and explore at least one of your datasets using the development environment you set up in Activity 0. At a minimum, you should determine:
- how many rows of data are in your dataset
- how to preview a sample of rows in your dataset
- how to refer to a particular row and column to find a value
- for each column, how many rows have data values, and what datatype are they
- what constitutes a missing value (null, blank space, NaN, placeholder value, etc.)
- descriptive summary statistics (min, max, mean, median, variance, number of distinct values, etc.) on each key column in your dataset
Don’t worry about generating visualizations or anything yet, that is a future activity.
Since my data is stored in a postgres database, I can use SQL for the first basic tasks.
- 1 How many rows of data are in the dataset
- 2 Preview sample of rows
- 3 Refer to a particular row and column to find a value
- 4 For each column, how many rows have data values, and what datatype are they
- 5 What constitutes a missing value
- 6 Descriptive summary statistics
- 7 Final words
- 8 Download source code
- 9 Share this:
- 10 Like this:
How many rows of data are in the dataset
Like I have shown last time, I have 8 tables in my nfldb. To determine the number of rows in SQL you use:
select count(*) from table_name;
I get the following numbers:
|Table name||# Rows|
Preview sample of rows
This is a task I often do at work in SQL to see whether the data in a table is valid or to remember what the data looked like. In R you would use:
And in SQL it depends on the database system you are using. Since I use postgres, you have to write this:
select * from game limit 10;
With this you get the first 10 rows. Since there are many columns I decided to only select a few. With the first command I get information about the table ‘game’ to see which columns are there and with the second one I get the columns I am interested in.
\d game select week, home_team, home_score, away_team, away_score from game limit 10;
The output is:
week | home_team | home_score | away_team | away_score ------+-----------+------------+-----------+------------ 16 | PIT | 20 | KC | 12 7 | JAC | 34 | BUF | 31 16 | TB | 3 | GB | 20 16 | STL | 27 | NYG | 37 16 | DAL | 42 | IND | 7 16 | OAK | 26 | BUF | 24 1 | NO | 17 | CIN | 7 1 | CAR | 27 | ARI | 16 1 | TB | 14 | TEN | 42 1 | PIT | 17 | BAL | 30 (10 Zeilen)
In combination with ‘order by’ this can also be used to select top N rows, e.g. top N games by away_score, but I don’t want to go into much detail here.
Refer to a particular row and column to find a value
Let’s say I want to find out the weight for a particular player.
select weight from player where full_name='Brock Osweiler';
weight -------- 240 (1 Zeile)
If there were no players with that name, I would get no results. If I would get several rows that would mean that more than one player has that name.
For each column, how many rows have data values, and what datatype are they
My dataset is rather comprehensive, so I will pick a few interesting columns. First I found out for each table how many columns it has. This can be done by querying the schema of the database:
select count(*) from information_schema.columns where table_name='agg_play'; -- insert correct column name
|Table name||# Columns|
To get information about the tables, I used this:
I did that for all tables and saw that most columns are set to ‘NOT NULL’ which means they can’t have null values.
The most common data types used are:
- gameid (seems to be custom id type)
- usmallint, smallint, real (numeric values)
- character varying(n), text (string/character values)
- utctime (date)
- field_pos, game_time, game_day, season_phase, player_pos, player_status (seem to be custom types)
The most occuring datatype is smallint.
To check out columns that allow NULL values, I chose the table ‘game’. Besides some information about the game itself (mostly date related and which teams play) it contains scores for each quarter and the final one. Only the final scores are marked as NOT NULL, which means they need to contain a value otherwise a row can not be inserted. The scores per quarter can be NULL and therefore I want to check how often those values were not inserted.
home_score | usmallint | not null home_score_q1 | usmallint | home_score_q2 | usmallint | home_score_q3 | usmallint | home_score_q4 | usmallint | away_score | usmallint | not null away_score_q1 | usmallint | away_score_q2 | usmallint | away_score_q3 | usmallint | away_score_q4 | usmallint |
As we already checked before, the table contains 2312 entries. To check how many rows have NULL entries at a certain column, do this:
select count(*) from game where home_score_q1 IS NULL;
To check all 8 columns I replaced ‘home_score_q1’ with the other column names. None of the 8 columns contains any NULL values. Which is always nice to see 🙂
What constitutes a missing value
Since I might work with SQL, Python and/or R, I did this exercise for all three.
In SQL NULL is not case-sensitive, you can also write null. In R you actually have to write NULL or NA and not null/na/… In R there is also NaN but I don’t see it as a missing value, it rather tells you that something went wrong during your computations because you used something that is not a number. For Python I am not entirely sure whether None is the only value that’s used for missing data, but it’s also case-sensitive.
Descriptive summary statistics
(min, max, mean, median, variance, number of distinct values, etc.)
For this exercise I chose column home_score of table game.
min, max, avg (mean), variance and counting distinct values is available in postgreSQL. The median function is not available, therefore I copied it from here.
The methods can be used like:
select min(home_score_q1) from game; select max(home_score_q1) from game; select avg(home_score) from game; select median(home_score) from game; select var(home_score) from game; select count(distinct home_score) from game;
The minimum value is 0, which is not surprising.
The maximum value is 62, which is pretty high but still reasonable.
The mean value is 21.65 and the median 22.0.
The variance is 133.55.
There are 56 different values (which is probably not very informative).
Because for this data I don’t think that the amount of different values is interesting, I looked at how often each score appears. What we see is an interesting pattern. 168 games ended with 0 points for the home team, 0 with 1 point, 1 with 2 points. There are some games with 3 or more points, with exceptions for games with 8, 11, … points. There are many games with scores of 13, 17, 20, 23, … This comes from the way how scores can be achieved. For a touch down a team gets 6 points and can get an additional point by a field goal or try a 2-point convertion (+2 points as the name suggests, but happens rarely because a field goal is a lot safer) and if you don’t get a touch down but a field goal you get 3 points (don’t hit me if I am wrong, just started watching the NFL and the rules are really complicated – just leave me a nice comment correcting me :)). There are possibilities to get 1, 2 or 3 points, and the next is 6 points. So it makes sense that there are many with 0 points, none with 1 point (would be possible but apparently it never happened), only 1 game with 2 points for the home team (actually I don’t know how this is possible – maybe the rules changed same time) and then there is a first peek with 3, 6 and 7 points. The game with 3 points was obviously a field goal, the 6 points could be a touch down or 2 field goals and the 7 points are probably a touch down followed by a field goal … If you can’t follow this, don’t worry, I just noticed how important it is to have domain knowledge, because otherwise I could not make any sense of the pattern behind these scores 🙂
home_score | count ------------+------- 0 | 168 2 | 1 3 | 43 6 | 37 7 | 57 8 | 3 9 | 22 10 | 86 11 | 9 12 | 18 13 | 104 14 | 77 15 | 15 16 | 60 17 | 152 18 | 18 19 | 58 20 | 148 21 | 55 22 | 34 23 | 117 24 | 159 25 | 22 26 | 51 27 | 136 28 | 59 29 | 15 30 | 71 31 | 115 32 | 15 33 | 24 34 | 84 35 | 39 36 | 11 37 | 31 38 | 53 39 | 6 40 | 10 41 | 25 42 | 14 43 | 12 44 | 8 45 | 24 46 | 1 47 | 4 48 | 12 49 | 9 51 | 5 52 | 3 53 | 1 54 | 1 55 | 4 56 | 1 58 | 1 59 | 3 62 | 1
For me this activity was not that difficult, I just refreshed my postgreSQL knowledge a little bit because I mainly use MSSQL. The most important thing I learned/noticed is that it is very helpful if you have background knowledge about the data (e.g. how the rules are for the game) otherwise it will be hard to find and understand patterns you notice or to see if the data makes sense.
Download source code
-- task 1 select count(*) from agg_play; select count(*) from drive; select count(*) from game; select count(*) from meta; select count(*) from play; select count(*) from play_player; select count(*) from player; select count(*) from team; -- task 2 select * from game limit 10; \d game select week, home_team, home_score, away_team, away_score from game limit 10; -- task 3 select weight from player where full_name='Brock Osweiler'; -- task 4 select count(*) from information_schema.columns where table_name='agg_play'; -- 98 select count(*) from information_schema.columns where table_name='drive'; -- 15 select count(*) from information_schema.columns where table_name='game'; -- 26 select count(*) from information_schema.columns where table_name='meta'; -- 5 select count(*) from information_schema.columns where table_name='play'; -- 26 select count(*) from information_schema.columns where table_name='play_player'; -- 100 select count(*) from information_schema.columns where table_name='player'; -- 16 select count(*) from information_schema.columns where table_name='team'; -- 3 \d agg_play \d drive \d game \d meta \d play \d play_player \d player \d team select count(*) from game where home_score_q1 IS NULL; select count(*) from game where home_score_q2 IS NULL; select count(*) from game where home_score_q3 IS NULL; select count(*) from game where home_score_q4 IS NULL; select count(*) from game where away_score_q1 IS NULL; select count(*) from game where away_score_q2 IS NULL; select count(*) from game where away_score_q3 IS NULL; select count(*) from game where away_score_q4 IS NULL; -- task 5 --median https://wiki.postgresql.org/wiki/Aggregate_Median#median.28numeric.29 select min(home_score) from game; select max(home_score) from game; select avg(home_score) from game; select median(home_score) from game; select variance(home_score) from game; select count(distinct home_score) from game; select home_score, count(*) from game group by home_score order by home_score;