Learning Club 01: Find and explore a dataset

The first activity of the data science learning club I am participating in is to find and explore a dataset. I already described the data I found and will use in the last post. You can follow all my learning club related activities here.

The tasks of this activity are (quoted from the thread above):

Learn how to import and explore at least one of your datasets using the development environment you set up in Activity 0. At a minimum, you should determine:

  • how many rows of data are in your dataset
  • how to preview a sample of rows in your dataset
  • how to refer to a particular row and column to find a value
  • for each column, how many rows have data values, and what datatype are they
  • what constitutes a missing value (null, blank space, NaN, placeholder value, etc.)
  • descriptive summary statistics (min, max, mean, median, variance, number of distinct values, etc.) on each key column in your dataset

Don’t worry about generating visualizations or anything yet, that is a future activity.

Since my data is stored in a postgres database, I can use SQL for the first basic tasks.

How many rows of data are in the dataset

Like I have shown last time, I have 8 tables in my nfldb. To determine the number of rows in SQL you use:

select count(*) from table_name;

I get the following numbers:

Table name # Rows
agg_play 379779
drive 50936
game 2312
meta 1
play 379779
play_player 841171
player 6736
team 33

Preview sample of rows

This is a task I often do at work in SQL to see whether the data in a table is valid or to remember what the data looked like. In R you would use:

head(myDataFrame)

And in SQL it depends on the database system you are using. Since I use postgres, you have to write this:

select * from game limit 10;

With this you get the first 10 rows. Since there are many columns I decided to only select a few. With the first command I get information about the table ‘game’ to see which columns are there and with the second one I get the columns I am interested in.

\d game
select week, home_team, home_score, away_team, away_score from game limit 10;

The output is:

 week | home_team | home_score | away_team | away_score
------+-----------+------------+-----------+------------
   16 | PIT       |         20 | KC        |         12
    7 | JAC       |         34 | BUF       |         31
   16 | TB        |          3 | GB        |         20
   16 | STL       |         27 | NYG       |         37
   16 | DAL       |         42 | IND       |          7
   16 | OAK       |         26 | BUF       |         24
    1 | NO        |         17 | CIN       |          7
    1 | CAR       |         27 | ARI       |         16
    1 | TB        |         14 | TEN       |         42
    1 | PIT       |         17 | BAL       |         30
(10 Zeilen)

In combination with ‘order by’ this can also be used to select top N rows, e.g. top N games by away_score, but I don’t want to go into much detail here.

Refer to a particular row and column to find a value

Let’s say I want to find out the weight for a particular player.

select weight from player where full_name='Brock Osweiler';
 weight
--------
    240
(1 Zeile)

If there were no players with that name, I would get no results. If I would get several rows that would mean that more than one player has that name.

For each column, how many rows have data values, and what datatype are they

My dataset is rather comprehensive, so I will pick a few interesting columns. First I found out for each table how many columns it has. This can be done by querying the schema of the database:

select count(*) from information_schema.columns where table_name='agg_play'; -- insert correct column name
Table name # Columns
agg_play 98
drive 15
game 26
meta 5
play 26
play_player 100
player 16
team 3

To get information about the tables, I used this:

\d agg_play

I did that for all tables and saw that most columns are set to ‘NOT NULL’ which means they can’t have null values.

The most common data types used are:

  • gameid (seems to be custom id type)
  • usmallint, smallint, real (numeric values)
  • character varying(n), text (string/character values)
  • utctime (date)
  • field_pos, game_time, game_day, season_phase, player_pos, player_status (seem to be custom types)

The most occuring datatype is smallint.

To check out columns that allow NULL values, I chose the table ‘game’. Besides some information about the game itself (mostly date related and which teams play) it contains scores for each quarter and the final one. Only the final scores are marked as NOT NULL, which means they need to contain a value otherwise a row can not be inserted. The scores per quarter can be NULL and therefore I want to check how often those values were not inserted.

 home_score     | usmallint            | not null
 home_score_q1  | usmallint            |
 home_score_q2  | usmallint            |
 home_score_q3  | usmallint            |
 home_score_q4  | usmallint            |
 away_score     | usmallint            | not null
 away_score_q1  | usmallint            |
 away_score_q2  | usmallint            |
 away_score_q3  | usmallint            |
 away_score_q4  | usmallint            |

As we already checked before, the table contains 2312 entries. To check how many rows have NULL entries at a certain column, do this:

select count(*) from game where home_score_q1 IS NULL;

To check all 8 columns I replaced ‘home_score_q1’ with the other column names. None of the 8 columns contains any NULL values. Which is always nice to see 🙂

What constitutes a missing value

Since I might work with SQL, Python and/or R, I did this exercise for all three.

Language Missing values
SQL NULL
R NULL, NA
Python None

In SQL NULL is not case-sensitive, you can also write null. In R you actually have to write NULL or NA and not null/na/… In R there is also NaN but I don’t see it as a missing value, it rather tells you that something went wrong during your computations because you used something that is not a number. For Python I am not entirely sure whether None is the only value that’s used for missing data, but it’s also case-sensitive.

Descriptive summary statistics

(min, max, mean, median, variance, number of distinct values, etc.)

For this exercise I chose column home_score of table game.

min, max, avg (mean), variance and counting distinct values is available in postgreSQL. The median function is not available, therefore I copied it from here.

The methods can be used like:

select min(home_score_q1) from game;
select max(home_score_q1) from game;
select avg(home_score) from game;
select median(home_score) from game;
select var(home_score) from game;
select count(distinct home_score) from game;

The minimum value is 0, which is not surprising.
The maximum value is 62, which is pretty high but still reasonable.
The mean value is 21.65 and the median 22.0.
The variance is 133.55.
There are 56 different values (which is probably not very informative).

Because for this data I don’t think that the amount of different values is interesting, I looked at how often each score appears. What we see is an interesting pattern. 168 games ended with 0 points for the home team, 0 with 1 point, 1 with 2 points. There are some games with 3 or more points, with exceptions for games with 8, 11, … points. There are many games with scores of 13, 17, 20, 23, … This comes from the way how scores can be achieved. For a touch down a team gets 6 points and can get an additional point by a field goal or try a 2-point convertion (+2 points as the name suggests, but happens rarely because a field goal is a lot safer) and if you don’t get a touch down but a field goal you get 3 points (don’t hit me if I am wrong, just started watching the NFL and the rules are really complicated – just leave me a nice comment correcting me :)). There are possibilities to get 1, 2 or 3 points, and the next is 6 points. So it makes sense that there are many with 0 points, none with 1 point (would be possible but apparently it never happened), only 1 game with 2 points for the home team (actually I don’t know how this is possible – maybe the rules changed same time) and then there is a first peek with 3, 6 and 7 points. The game with 3 points was obviously a field goal, the 6 points could be a touch down or 2 field goals and the 7 points are probably a touch down followed by a field goal … If you can’t follow this, don’t worry, I just noticed how important it is to have domain knowledge, because otherwise I could not make any sense of the pattern behind these scores 🙂

 home_score | count
------------+-------
          0 |   168
          2 |     1
          3 |    43
          6 |    37
          7 |    57
          8 |     3
          9 |    22
         10 |    86
         11 |     9
         12 |    18
         13 |   104
         14 |    77
         15 |    15
         16 |    60
         17 |   152
         18 |    18
         19 |    58
         20 |   148
         21 |    55
         22 |    34
         23 |   117
         24 |   159
         25 |    22
         26 |    51
         27 |   136
         28 |    59
         29 |    15
         30 |    71
         31 |   115
         32 |    15
         33 |    24
         34 |    84
         35 |    39
         36 |    11
         37 |    31
         38 |    53
         39 |     6
         40 |    10
         41 |    25
         42 |    14
         43 |    12
         44 |     8
         45 |    24
         46 |     1
         47 |     4
         48 |    12
         49 |     9
         51 |     5
         52 |     3
         53 |     1
         54 |     1
         55 |     4
         56 |     1
         58 |     1
         59 |     3
         62 |     1

Final words

For me this activity was not that difficult, I just refreshed my postgreSQL knowledge a little bit because I mainly use MSSQL. The most important thing I learned/noticed is that it is very helpful if you have background knowledge about the data (e.g. how the rules are for the game) otherwise it will be hard to find and understand patterns you notice or to see if the data makes sense.

Download source code

-- task 1

select count(*) from agg_play;
select count(*) from drive;
select count(*) from game;
select count(*) from meta;
select count(*) from play;
select count(*) from play_player;
select count(*) from player;
select count(*) from team;


-- task 2

select * from game limit 10;
\d game
select week, home_team, home_score, away_team, away_score from game limit 10;


-- task 3

select weight from player where full_name='Brock Osweiler';

-- task 4

select count(*) from information_schema.columns where table_name='agg_play'; -- 98
select count(*) from information_schema.columns where table_name='drive'; -- 15
select count(*) from information_schema.columns where table_name='game'; -- 26
select count(*) from information_schema.columns where table_name='meta'; -- 5
select count(*) from information_schema.columns where table_name='play'; -- 26
select count(*) from information_schema.columns where table_name='play_player'; -- 100
select count(*) from information_schema.columns where table_name='player'; -- 16
select count(*) from information_schema.columns where table_name='team'; -- 3

\d agg_play
\d drive
\d game
\d meta
\d play
\d play_player
\d player
\d team

select count(*) from game where home_score_q1 IS NULL;
select count(*) from game where home_score_q2 IS NULL;
select count(*) from game where home_score_q3 IS NULL;
select count(*) from game where home_score_q4 IS NULL;
select count(*) from game where away_score_q1 IS NULL;
select count(*) from game where away_score_q2 IS NULL;
select count(*) from game where away_score_q3 IS NULL;
select count(*) from game where away_score_q4 IS NULL;

-- task 5

--median https://wiki.postgresql.org/wiki/Aggregate_Median#median.28numeric.29

select min(home_score) from game;
select max(home_score) from game;
select avg(home_score) from game;
select median(home_score) from game;
select variance(home_score) from game;
select count(distinct home_score) from game;
select home_score, count(*) from game group by home_score order by home_score;

Leave a Reply

Your email address will not be published. Required fields are marked *