Finding data sets Part 2: TV, music, book ratings and sports data

The first part gave a more general overview on where to get data. This section will give you specific data sources, e.g. if you like sports, movies, books, … and so on.

Over the next couple of weeks you’ll find these posts on my blog:

  • General data sources
  • TV, music, book ratings and sports data (this post)
  • Weather, geographical and government data
  • Special data sources (that will be different data sources I found interesting but did not fit any of the above categories)

Movies and TV series

Previously I was looking for data that could be used to test and train a recommender system. I’ve heard about the Netflix Prize before, for which they had provided a suitable dataset. Unfortunately this dataset is not shared by Netflix anymore. Some nice user sent it to me after I requested it on /r/datasets but it’s too big to upload it here on the blog and I think it’s not even aloud to share it anymore. But I have it 😉

As an alternative I found the movielens dataset, which has similar structure (ratings per user and movie). The page provides 6 different versions.

Additionally I found out that grouplens, a research group at the University of Minnesota, provides 6 other datasets useful for recommender systems, e.g. from last.fm, http://www.bookcrossing.com/.

Furthermore I found imdb to be a useful resource for movie and TV series ratings. You can:

At this point I’d like to point out the shiny app IMDb Explorer, which lets you have a first glimpse at figures like

There is also an API by rottentomatoes.com.

Furthermore, there is the Open Movie Database (OMDb) which has an official API.

This is how you would query the first season of Game of Thrones:

http://www.omdbapi.com/?i=tt0944947&Season=1

And the result is:

{"Title":"Game of Thrones","Year":"2011–","Rated":"TV-MA","Released":"17 Apr 2011","Runtime":"56 min","Genre":"Adventure, Drama, Fantasy","Director":"N/A","Writer":"David Benioff, D.B. Weiss","Actors":"Peter Dinklage, Lena Headey, Emilia Clarke, Kit Harington","Plot":"Several noble families fight for control of the mythical land of Westeros.","Language":"English","Country":"USA","Awards":"Won 1 Golden Globe. Another 133 wins & 250 nominations.","Poster":"http://ia.media-imdb.com/images/M/MV5BMTYwOTEzMDMzMl5BMl5BanBnXkFtZTgwNzExODIzNzE@._V1_SX300.jpg","Metascore":"N/A","imdbRating":"9.5","imdbVotes":"877,359","imdbID":"tt0944947","Type":"series","Response":"True"}

Music

As already mentioned above, grouplens hosts many useful datasets for recommendation systems. One of their subcategories is HetRec 2011, containing datasets provided by yhe 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems. One of them is a Last.FM dataset containing 92,800 artist listening records from 1892 users.

Last.FM also has an API, which contains articst, user, song, … information, but also the possibility to get events by geo location, which is cool.

For music, sometimes also metadata might be interesting. For example http://developers.music-story.com provides methods to get information about artists, even including their Twitter account and other interesting data. If you have such information and maybe also age and gender information about the users, it might be a lot easier to build a Recommender System.

There was even a contest in 2012 asking “CAN YOU PREDICT WHO WILL LOVE A NEW SONG?”. They published their One Million Interview Dataset but unfortunately the download link (at the bottom of the page) does not seem to work anymore. And everyone else who claims to have it just links to the same broken link. Maybe someone can find it and tell me 🙂

echonest.com also provide an API, they even have parameters like minimum danceability. As a total unmusical person I don’t have any idea what to do with that but other might find it helpful 🙂

programmableweb.com has a huge list of APIs about metadata for music, I guess it’s worth checking that out. I did not find anything useful on mashape.com though.

Books

goodreads provides an API to 10 million reviews across 700,000 book titles.

As already mentioned above, grouplens hosts many useful datasets for recommendation systems. One of their subcategories is Book-Crossing. The dataset contains “278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books” collected in August/September 2004.

Sports data

Getting hold of sports data turned out to be most difficult. A few weeks ago I tried to find NFL (National Football League) data and searched the above mentioned API providers but the was disappointing. Either the data was old, not relevant or really, really expensive. From their point of view it makes sense to sell the data, because people who play Fantasy Football (must see!) are willing to pay money for it to predict game outcomes.

My super intelligent boyfriend Patrick finally found the perfect project on github: nfldb. The user BurntSushi provides a python library to setup and update a NFL database regularly. It took Patrick one evening to install the database on our server and now we have all the data that’s relevant. BurntSushi also provides documentation.

I found another user on github who provides data for many different sports: repo of octonion. I have not looked into it very deeply yet but it will definitely be worth to check it out.

Final words

This was the second post in my series about data sources. Researching for and writing this post I learned that for movies, music and books the most obvious data is user ratings, but there is a lot more to it. Metadata can be very useful and if you have it you have more possibilities what to do with the data (and it allows for nicer visualisations 😉 ). Such data is mostly free because it is often collected for a competition and published afterwards or such web sites provide it as a service for their users. This is often even an advantage for them because they encourage users to write apps using their APIs. I also learned that sports data (no matter which sports data) is harder to get, probably because most TV channels charge money for it. People seem to be willing to pay for it because using the data right can increase their chances in betting and fantasy sports.

2 thoughts on “Finding data sets Part 2: TV, music, book ratings and sports data”

Leave a Reply to hniemeye Cancel reply

Your email address will not be published. Required fields are marked *