Renewed interest in big data and data visualisation has more people searching for public datasets.
Inspired by a post on Quora asking for some ideas on public datasets, I’ve gone through and developed a list of the top ten sources of free public data.
You often hear that civilisation is producing more data than at any point in history. What most people don’t know is that much of this data is available to the public in one way or the other.
Whether you’re a beginner or an advanced analyst, publicly available datasets are a great way to refine data mining process and visualisation skills.
data.gov.au provides an easy way to find, access and reuse public datasets from the Australian Government. They encourage users to use government data to analyse, mashup and develop tools and applications which benefit all Australians.
While the data is available for download, the site also features interactive visualisations for each dataset. Below is an example of the Native Title determinations public dataset.
Acadmeic Torrents is a distributed system built by researchers, for researchers, designed for sharing enormous datasets. Its distributed peer-to-peer library system automatically replicates datasets on many servers, so you don’t have to worry about managing your own servers or file availability.
Crunchbase is the free database of technology companies, people and investors. The data able to updated by the public for accuracy and can be used to develop an understanding of the startup world.
The IOGDS is a linked data application based on metadata “scraped” from hundreds of international dataset catalog websites publishing a rich variety of government data.
Datafiniti (subscription based)
Dafiniti prides itself on fuelling business intelligence with high-powered web data. They collect, clean & merge over 1 million records from dozens of websites each day to provide fresh data on businesses and products.
Quandl gives users the ability to find, use and share numerical data with a database of over 8,000,000 financial, economic and social public datasets.
From the GDELT About page:
“The Global Database of Events, Language, and Tone (GDELT) is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what’s happening around the world, what its context is and who’s involved, and how the world is feeling about it, every single day.”
In the visualisation shown below, GDELT have produced an interactive map of violent events in Syria.
Founded in 2007, Socrata’s vision was based on the idea that public data made available in the cloud would serve researchers, public policy experts, entrepreneurs, and busy citizens to better understand their government services.
Datamob is an excellent hub for a variety of data resources. Featured data sources include Twitter, Flickr, Digg, Amazon, Prosper, Netflix and Wikipedia, as well as links to mining and cleaning tools such as Data Wrangler from the Stanford Visualisation Group.
CKAN is a powerful data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. CKAN is aimed at data publishers (national and regional governments, companies and organisations) wanting to make their data open and available.
Freebase contains tens of millions of topics, thousands of types, and tens of thousands of properties. Each of the topics in Freebase is linked to other related topics and annotated with important properties like movie genres and people’s dates of birth. There are over a billion such facts or relations that make up the graph and they’re all available for free through the APIs or to download from their weekly data dumps.
In the past we’ve worked on our own data visualisation projects using public data – our most popular piece being our visualisation of Australian suburb boundaries using Australian Bureau of Statistics data.
The sites listed above provide a variety of public data and it’s easy to see the social value offered by making it available.