B-data

B-data is a curated collection of resources for working with online behavioural data. In recent years, there has been an enormous growth of freely available datasets. In parallel, computational resources and tools to process and analyse them have become more broadly accessible. Lastly, working on pre-existent data can be an important resource in situations where collecting new data may be problematic, for example, due to contemporary COVID 19-related restrictions.

The collection aims to provide not only links to datasets of psychological interest, but a more comprehensive perspective on working with online data, including software, examples of previous researches, or pointers to tools for extracting your data from webpages and social media. It will be hopefully useful for research, student projects, and teaching.

Finally, the number of possible datasets and software out there is enormous. The collection tries to strike a balance between the standard choices and resources linked to more specific topics, and cannot be in any way exhaustive. If you have a niche interest you would probably not find here a datasets, but it should be, at least, a good place to start exploring.

If you have any comment, feedback, or suggestion for a resource to include please email alberto.acerbi@brunel.ac.uk

Software

This section collects resources for processing and analysing data. While it is possible working on some of the datasets linked below with a software like Excel or SPSS (or, better, Jamovi, free and based on R), it is strongly advised to use more versatile programming languages. The language of choice is here R, a very popular language in academia, which is free and has a huge community of developers that makes easy to find further dedicated resources online.

Software, libraries, tools

Jamovi: Jamovi is a free and open statistical software based on R, but similar to Excel or SPSS.

R software: R is a free software environment for statistical computing and graphics.

RStudio: The integrated development environment (IDE) for R. Free and open source.

Tidyverse: A collection of R packages designed for data science, that are becoming the standard in R usage.

R Markdown: Allows to create fully-reproducible documents with R, that contain code, text, and plots. They can be exported in various format, including HTML, PDF, and MS Word.

GitHub: GitHub is a website and cloud-based service where you can store your projects, especially cotaining codes and data. It is based on Git, a version control system that allows to keep track of all the modifications of the projects, facilitating collaborations and retrival of previous versions of your files.

Tutorials and manuals

Getting started with R: Collection of resources from RStudio Education to start with R.

R for Data Science: Online book, free to use, providing a solid foundation of data science with R, using tidyverse.

GitHub Hello World: Get started with GitHub. Part of the GitHub Guides series, that cover various topics.

Curating Research Assets: A Tutorial on the Git Version Control System: a striaghtforward guide to use GitHub from RStudio, so to have the advantages of both.

Text Mining with R: Online book, free to use, exploring, from scratch, text analysis with R (and the tidyverse). Cover topics such as word-frequency analysis, sentiment analysis, and topic modelling.

Papers

This section links to a few papers, journals, or special issues that are particuarly relevant for the usage of online behavioural data.

Journal of Open Psychology Data: JOPD "features peer reviewed data papers describing psychology datasets with high reuse potential".

Special Issue on Big Data in Behavior Research Methods: 29 articles (from 2019) with examples of application of "big data" in psychology.

Quantitative Analysis of Culture Using Millions of Digitized Books: The foundational text of "culturomics", presenting the Google books ngram corpus.

Coding culture: challenges and recommendations for comparative cultural databases: Methods paper about good practices for the construction - but relevant for the usage - of cross-cultural linguistic and ethnographic datasets.

Datasets

This is the main section of the collection, with links to the datasets. It is divided in two subsections:

Naturally occurring data are, roughly, data that are produced without a direct intervention of the researcher. They are the result of activities not finalised to research. Examples of naturally occurring data relevant for psychology are texts, videos, audio recording, social media posts, etc. In general, they require to be pre-processed before being analysed.
Secondary data, surveys and statistics are data collected by reseachers, resulting from experiments, surveys or data resources collected by governements or organisations, including also ehtnographic data.

Naturally occurring data

Texts

Project Gutenberg: A metadata rich library of over 60,000 free eBook.

Google Books corpus: The dataset provides provide yearly frequncies of n-grams (technically combinations of adjacent words or letters of length n, informally words) for millions of books in different languages. The Viewer allows to visualise the data online for simple searches.

CMU Book Summary Dataset: This dataset contains plot summaries for 16,559 books extracted from Wikipedia, along with aligned metadata from Freebase, including book author, title, and genre.

CMU Movie Summary Corpus: 42,306 movie plot summaries extracted from Wikipedia + aligned metadata extracted from Freebase, including Movie box office revenue, genre, release date, runtime, language, character names and aligned information about the actors who portray them, including gender and estimated age at the time of the movie's release.

HatiTrust: HathiTrust is a digital library that offers reading access (to the fullest extent allowable by U.S. copyright law) and computational access for scholarly research to the entire corpus of more than 17 million digitaied volumes. It requires institutional membership, but a guest login (with some limitation) is available.

The New York Times Annotated Corpus: The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff.

NOW Corpus: The NOW (News on the Web) Corpus contains 10.8 billion words (and counting) from web-based newspapers and magazines from 2010 to the present time. Part of corpusdata.org, a collection of ten downloadable full-text corpus data (English, Spanish, Portoguese).

ParlSpeech: Full-text corpora of 6.3 million parliamentary speeches in the key legislative chambers of Austria, the Czech Republic, Germany, Denmark, the Netherlands, New Zealand, Spain, Sweden, and the United Kingdom, covering periods between 21 and 32 years.

Quotebank: Quotebank is a dataset of 178 million unique, speaker-attributed quotations that were extracted from 196 million English news articles crawled from over 377 thousand web domains between August 2008 and April 2020.

Visuals

Faces extracted from Time Magazine 1923-2014: A dataset consisting in 327,322 faces from 3389 issues of the Time Magazine.

YouTube 8M: YouTube-8M is a large-scale video dataset that consists of more than 6 millions of YouTube video IDs, with machine-generated annotations from a diverse vocabulary of 3,800+ visual entities.

The Quick, Draw! Dataset: The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. Drawings are timestamped and tagged with metadata including what the player was asked to draw and in which country the player was located.

CelebA: The CelebA dataset contains more than 200,000 images of 10,000+ celebrities, each annotated with 40 yes/no variables.

Stanford Cable TV News Analyzer: The dataset includes near 24-7 recordings of CNN, Fox News, and MSNBC between January 1, 2010 and August 20, 2020, provided by the Internet Archive's TV News Archive. Videos are labeled with metadata generated with face detection and can be searched with names. The output provides a timeline with minutes of screen per unit of time.

Audio

Million Songs Dataset: A freely-available collection of audio features and metadata for a million contemporary popular music tracks.

LibriSpeech ASR corpus: LibriSpeech is a corpus of approximately 1000 hours of read English speech, derived from read audiobooks from the LibriVox project. Hosted in the OpenSLR, wich collects other speech and language datasets.

FSD: the Freesound datasets includes a variety of everyday sounds (297,144 audio samples), from human and animal, music and sounds made by things, all under Creative Commons licenses.

The Global Jukebox: A dataset of songs, dances from hundreds of regions in the world. Based on the collection of ethnomusicologist Alan Lomax. The songs can be played directly in the website, and all data are available in a dedicated Github repository.

Social media

COVID-19-TweetIDs: The repository contains an ongoing collection of tweets IDs (123 million and counting) associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020.

2020 US Presidential Election Tweet IDs: The repository contains an ongoing collection of tweets IDs associated with the 2020 United States presidential elections, with the data collection starting on May 20, 2019. It contains 240 million tweets at October 2020.

Twitter Information Operation/Transparency Reports: Publicly available archives of Tweets and media believed (by Twitter) resulting from potentially state-backed information operations on the social media. The access requires providing email address.

Reddit Comment and Thread Data: Around 260,000 texts of threads / comments scraped from Reddit, with metadata including upvotes and downvotes and the subreddit (topic).

Foursquare Dataset: Various datasets with data extracted from Foursquare, a location-based social network.

storywrangler: Data on frequency at daily resolution of n-grams in ~10% of public messages in Twitter from 2008. Described in this paper.

Miscellaneous

Tesco Grocery 1.0: The Tesco Grocery 1.0 dataset is a record of 420M food items purchased by 1.6M fidelity card owners who shopped at the 411 Tesco stores in Greater London over the course of the entire year of 2015, aggregated at the level of census areas to preserve anonymity. For each area, we report the number of transactions and nutritional properties of the typical food item bought including the average caloric intake and the composition of nutrients. The set of global trade international numbers (barcodes) for each food type is also included. Described in this paper.

GeoLife GPS Trajectories: A GPS trajectory dataset on the movements of 182 users in a period of over three years (from April 2007 to August 2012). This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours (91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point).

Amazon Review Data: 233.1 million Amazon reviews, from May 1996 to Oct 2018, divided by category, with metadata including product and reviewer IDs.

The Multilingual Amazon Reviews Corpus: A collection of Amazon reviews specifically designed for multilingual research. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish (~200,000 for each), balanced across the 5 possible star ratings.

Google Trends: Google Trends provides the popularity of top search queries in Google. It is possible to focus the search across various regions and languages.

Secondary data, surveys and statistics

Secondary data

Databrary: A NYU data library for developmental scientists to securely store, manage, share, discover, and reuse research data, including videos, audio files, procedures and stimuli, and related metadata.

Face databases: A collection of several datasets containing images and videos of human faces.

American Multiracial Faces Database: 110 faces (smiling and neutral expression poses) with mixed-race heritage and ratings of various attributes freely available to academic researchers.

Cam-CAN: The Cambridge Centre for Ageing and Neuroscience dataset inventory. Nearly 3000 adults aged 18-90 completed a home interview, and a subset of nearly 700 (100 per decade) were scanned using structural Magnetic Resonance Imaging (MRI), functional MRI (both resting and task-based), magnetoencephalography (MEG), and completed multiple cognitive experiments.

Human Development and Quantitative Methods lab: A collection of secondary datasets, mostly relevant for developmental psychology, from The Human Development and Quantitative Methods lab at the University of Michigan.

closer: closer provides access to high quality data from nine UK longitudinal studies. Quick access to all information on their website.

GAAIN: Collection of various datasets (neuropsychological, functional, and psychiatric variables) from the The Global Alzheimer’s Association Interactive Network.

ABCD: Adolescent Brain Cognitive Development study. Data for 4,500 US adolescent, exploring how childhood experiences (such as sports, videogames, social media, unhealthy sleep patterns, and smoking) interact with each other and with a child’s changing biology to affect brain development and social, behavioral, academic, health, and other outcomes.

RAVDESS: The Ryerson Audio-Visual Database of Emotional Speech and Song contains 7356 files. The database contains 24 professional actors (12 female, 12 male), vocalising two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

MIDSS: While the Measurement Instrument Database for the Social Sciences is a repository of instruments used to collect data from across the social and psychological sciences, each instrument is associated to paper(s) that used it and mostly they have data available.

AOMIC: The Amsterdam Open MRI Collection, a set of multimodal MRI datasets for individual difference analyses. Three datasets (links in the paper) with multimodal (3T) MRI data including structural (T1-weighted), diffusion-weighted, and (resting-state and task-based) functional BOLD MRI data, as well as detailed demographics and psychometric variables from a large set of healthy participants (N = 928, N = 226, and N = 216).

Small World of Words: over 3 million associative responses to 12,292 cues from 90,000 people (in English and Dutch).

Millennium Cohort Study: The Millennium Cohort Study is following the lives of around 19,000 young people born across England, Scotland, Wales and Northern Ireland in 2000-02.

Open-Source Psychometrics Project: Data from various common psychological questionnaires.

OpenNEURO: A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data - 400 datasets and counting.

CONNECTOME: A collection of various datasets (neuroimaging, personalit,y health) from the Human Connectome Project.

The MatchNMingle dataset: A multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates (it includes data from wearable acceleration, binary proximity, video, audio, personality surveys, frontal pictures and speed-date responses).

childes-db: An open database storing child language datasets from CHILDES, containing transcripts and recordings relevant to the study of child language acquisition.

Cross-cultural linguistic/ethnographic datasets

eHRAF: The eHRAF World cultures is a collection of ethnographic documents for 320 cultures. The documents are hand-codes with keywords at paragraph level to facilitate search. It is not freely available, it requires institutional membership. A 30-day trial access is given the individual researchers.

PULOTU: 116 Austronesian cultures coded for 62 variables on religion, history, society, and the natural environment.

C&E Cultural Variable Dataset: an open-access database of cultural similarities and differences that have been investigated around the world (defined here as at least 20 societies).

DRH: The Database of Religious History consists of 397 entries on religious groups/places, with coded responses to poll questions.

Seshat: Various datasets linked to the "Seshat: Global History Databank" project. Historical, political, economic and religious variables for various areas around the world.

D-PLACE: The Database of Places, Language, Culture and Environment codes more than 2,000 variables for almost 2,000 societies.

WALS: The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of 2,662 languages.

Surveys

World Value Survey: A global research project that explores people's values and beliefs, how they change over time, and what social and political impact they have.

Reasons for Facebook usage: data from 46 countries: Reasons for Facebook usage and demographic data for >16,000 individuals from 46 countries. Described in details in this paper.

Afrobarometer: A series of public attitude surveys on democracy, governance, the economy and society in more than 30 African countries repeated on a regular cycle.

Asian Barometer: The Asian Barometer Survey is a cross-national survey of democracy, governance and development project. Five waves of sirvey available with eigth to 14 countries.

Eurobarometer: A series of public opinion surveys conducted regularly on behalf of the European Commission since 1973. These surveys address a wide variety of topical issues relating to the European Union.

Latinobarometro: Data on an annual public opinion survey (from 1995) that involves some 20,000 interviews in 18 Latin American countries, representing more than 600 million people.

Pew Research Center: Public opinion polling, demographic research, content analysis and other data-driven research on various topics. Data download requires a free account.

General statistics

UK Data Service: UK‚ largest collection of social, economic and population data resources.

London Datastore: A free and open data-sharing portal where anyone can access data relating to London. The site provides over 700 datasets.

IPUMS: IPUMS provides census and survey data from around the world. It includes almost a billion records from U.S. censuses from 1790 to the present and over a billion records from the international censuses of over 100 countries. In addition, it includes surveys data describing 1.4 billion individuals drawn from over 750 censuses and surveys.

Human Development Data (1990-2018): Various demographic, social, and economoic data collected by UNDP on key aspects of human development.

UNdata: Official statistics produced by countries and compiled by United Nations data system, as well as estimates and projections. The domains covered are agriculture, crime, education, energy, industry, labour, national accounts, population and tourism. You can also find indicators such as Millennium Development Goals.

Womanstats project: Cover 170,000 data points - over 350 variables for 175 nations with populations greater than 200,000 persons. Variables include those relating to nine aspects of women’s situation and security: Physical Security, Economic Security, Legal Security, Security in the Community, Security in the Family, Security for Maternity, Security Through Voice, Security Through Societal Investment in Women, Security in the State.

ARDA: The Association of Religion Data Archives includes a curated collection of 1,000 files with various data on religion.

Archives

The Opie Archive: Questionnaires, letters and short essays by schoolchildren, c.1947-1989, describing rhymes, games, school and playground lore and activities.

DIY

With some more work it is possible to harvest the online data we are interested directly from social media or web pages. This section covers some resources to do that. They require programming experience.

Social media data collection

rtweet: R client for accessing Twitter‚ REST and stream APIs.

Exploring tweets in R: a basic tutorial showing how to use the rtweet package to retireve tweets and explore the basic results.

API reference index: Official documentation of the Twitter APIs.

Getting started with the Twitter API v2 for academic research: An introductory course to getting started with the Twitter API for research, assuming very basic knowledge of R or Python.

tuber: Access YouTube from R: Get comments posted on YouTube videos, information on how many times a video has been liked, search for videos with particular content, all through R.

YouTube Data API Overview: Official guide to YouTube APIs.

RedditExtractoR: an R package to retrieve data from Reddit.

Reddit APIs: Documentation for the reddit APIs.

Querying APIs in R: A more general introduction to APIs and how to use R for interacting with (some of) them.

Web scraping

rvest: R library to scrape information from web pages.

Tidy web scraping in R — Tutorial and resources: A useful blog post introducing web scraping in R, with practical examples.

Links to other collections of resources

Here other collections of resources similar to this one, some specifically dedicated to psychology, others more general.

Harvard Dataverse: Collection of datasets from Harvard University.

APA datasets: Collection of links to data sets and repositories from the American Psychological Association.

Open psychological datasets: List of public data curated by the Society for the Improvement of Psychological Science.

Secondary data sources: List of secondary data sources compiled by Thomas Pollet.

Secondary data resources for the evolutionary human sciences: List of resources for researchers interested in working with secondary data in psychology and evolutionary human sciences, curated by Rebecca Sear.

Our World in Data: A University of Oxford project colelcting more than 3,000 datasets and visualisations "dedicated to a large range of global problems in health, education, violence, political power, human rights, war, poverty, inequality, energy, hunger, and humanity’s impact on the environment".

ICPSR: The Inter-university consortium for political and social research maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. It hosts 21 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields.

The Pudding: Not exactly a collection, but a a digital magazine that explores current issues with state-of-the-art data analysis/visualisations. Find the datasets used on GitHub. Many pop culture-related datasets, possibly useful for teaching.

Google Dataset search: A search engine from Google to locate online data freely available for use.

FiveThirtyEight - data: the data and code behind many FiveThirtyEight articles and graphics.

Data Is Plural: An unstructured collection of more than 1,000 (and counting) datasets from the "Data Is Plural" newsletter. Also contains many "curious" datasets, possibly useful for teaching.

data.gov.uk: Open data from UK government, published by central government, local authorities and public bodies.