Finding data sets Part 1: General data sources

I often encounter interesting algorithms or R packages which I want to test. The nice ones provide data for testing but often it is only dummy data. To get a good understanding of the method and its limitations real data might be required. Sometimes I would also like to explore data I have not used before or just create a cool visualisation. Most of the time I need to find useful data first. Some time ago I started to collect (and bookmark) any interesting data I find and now I want to share that with you.

Of course, when researching for this post, I found that someone already has done a similar list: 30 Places to Find Open Data on the Web | Romy Misra. Some of the sources I found are in there already. But my list will also cover data sources that are not covered in any other collection that I know of.

Over the next couple of weeks you’ll find these posts on my blog:

  • General data sources (this post)
  • TV, music, book ratings and sports data
  • Weather, geographical and government data
  • Special data sources (that will be different data sources I found interesting but did not fit any of the above categories)

General Data Sources

APIs

An application programmable interface (API) is usually a set of routines provided by someone, e.g. a company. They can provide services like computations when the user or a client application sends the right request. They can also be used as data sources. Mostly you need an API key, which identifies you or your application. This API key can also be used to bill you (in case the API is not for free) or limit access (it is not uncommon to have a limit on the number of requests).

I know 2 web sites that provide a searchable list of APIs:

I’ve only used programmableweb.com to look for NFL APIs but was not very successful and did not use it any further so I can not comment on how well the APIs are documented.

For mashape.com I can say that it has nice documentation and many samples in different programming languages (not R 🙁 ) for the APIs (at least the ones I tried to use).

So far I am not a very big fan of APIs because mostly authentication and so on has been cumbersome.

Web Scraping

Sometimes there is data on a website you need but they don’t provide an API. A possibility you have then is to utilise web scraping which is used to convert rather unstructured HTML text into data that you can use.

A tool to help you with this is import.io. Look at some nice showcases to convince yourself.

For web scraping there is also this really cool R package rvest which I used to scrape NFL weather data.

Web scraping has its limitations though, which make it sometimes very hard or even impossible to get the data you want although it’s right in front of you (in your web browser). Such difficulties can be:

  • Websites that require a login: This requires that the tool you use can work with HTTP sessions and “remember” that you logged in
  • AJAX driven data generation: Some pages use lots of Java script to display the data you see, e.g. to load another page when you click on “Next”. This can be tricky too.

Competitions

A good resource for data sets are data science competitions. The cool aspect here is that you can get inspiration on what is the goal with the specific dataset but you can still do with it what you want. It is also nice that probably other users are working on the same dataset and you can discuss things. The host of the competition can of course prohibit using it for anything else than the contest and you are probably not allowed to share the data and most of the time the data is removed from the website after the competition finished. Nevertheless, it’s worth checking it out.

The best known website for data science competition is probably kaggle.com. I have an account and also looked into some problems/data sets but to participate it’s probably best to be in a team and have more time. kaggle also provides some “Getting started” datasets, which is pretty cool. In one of the Machine Learning classes I took we have used a set of handwritten digits to train support vector machines and neural networks.

Two competitions I find especially interesting are

Other websites that host competitions:

Reddit and other online portals

On Reddit (and probably other online communities) you can ask for data too. If there is the right area for data science, there are probably lots of people who can give you data.

Reddit (/ˈrɛdɪt/) is an entertainment, social networking, and news website where registered community members can submit content, such as text posts or direct links, making it essentially an online bulletin board system. (wikipedia)

There is basically a subreddit for everything.

/r/datasets is the first place to go for finding datasets. Either you find something by scrolling through the endless amounts of links or you [REQUEST] certain data. Your request can be specific (e.g. “all ZIP codes of Austria”) or you can ask for something unspecific (“data from any horse races”) if you just need a dataset to prove a concept.

One of my favourite subreddits is /r/dataisbeautiful, which is not a data source but a nice collection of data visualisation. This can be a great source of inspiration.

There is a list of further subreddits for you data lovers:

Open Data Stackexchange

Stackexchange.com is a group of Q&A communities. You can find a community for many different topics. OpenData StackExchange is currently in beta. You can ask anything related to open data and also ask for specific datasets.

Newsletters

Recently I came across the newsletter Data is Plural which is “A weekly newsletter of useful/curious datasets” (it’s pretty new). The nice thing is they also have an archive where you can find recently sent data. If you just want to dive and what kind of data you could use, it’s the right thing for you!

Example data in R

The R package {datasets} contains lots of datasets.

Many packages also show in their help section how to randomly generate test data or how to load provided test data.

Many people also create data packages, e.g. here. Hadley Wickham has several such data packages that can be installed and directly be used in R.

Leave a Reply

Your email address will not be published. Required fields are marked *