Open Twitter datasets for COVID-19 research

27 Aug 2021

By Dong Danping, Librarian, Research Services

Academic researchers use tweets to study the public conversation and sentiment around important issues. With how Twitter is making it easier to extract tweets (read our article on this), tweets are becoming a more attractive data source.

One of the biggest events in recent times is the COVID-19 pandemic; t how do you decide what tweets to collect? While you can use different identification techniques like querying by hashtags or keywords, you may not need to reinvent the wheel.

Many researchers around the world have collected and may even have processed and labelled collections of tweets covering different aspects around COVID-19 topic that you can consider reusing.

Note: most open Twitter datasets only provide Tweet IDs due to the terms and conditions restrictions from Twitter. They need to be ‘re-hydrated’ (Refer to the Beginner’s guide to Twitter data) to get the actual content of the tweets. Please remember to cite the data if you plan to use them.

Related research and datasets collected by SMU research

Do you know that our own SMU researchers are also collecting and researching with COVID-19 Twitter data? Dr. Richard Crowley, Assistant Professor of Accounting, has some of his research papers based on Twitter data (e.g. on firm CSR reputation and discretionary dissemination). One of his most recent ongoing social media research also studies the sentiments and emotions surrounding COVID-19 in the US. His team has collected over 1 billion tweets from February 2020 till present, with relevant COVID hashtags and a secondary country-specific classification, covering over 200 countries/regions and across 20+ languages.

A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration

This is probably one of the most comprehensive open COVID-19 datasets available. Data coverage starts from January 2020 and is still getting updated bi-weekly, resulting in a dataset containing more than 1 billion unique tweets.

Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

A multilingual COVID-19 Twitter dataset containing an ongoing collection of tweet IDs, starting from Jan 2020 and updated weekly, made available via their GitHub repository.

A COVID-19 Rumor Dataset

Read this data article to find out more about the COVID-19 Rumor Dataset containing 6832 twitters (4129 rumors from news & 2705 rumors from tweets) with sentiment and stance labels, as well as reposts and retweets of the rumors. The dataset is openly available on Github.

Cornavirus (COVID-19) Tweets Dataset

This is a large-scale Twitter dataset that contains IDs and sentiment scores of the tweets related to COVID-19. Read the related paper that explains this dataset in greater detail: Design and analysis of a large-scale COVID-19 tweets dataset.

Global Reactions to COVID-19 on Twitter: A Labelled Dataset with Latent Topic, Sentiment and Emotion Attributes

As strong concerns and emotions might be present in publicly available tweets discussing COVID-19, all tweets from this dataset are annotated with latent semantic attributes such as topic, emotions, and their degree of intensity. There is a Singapore subset of the data if you wish to focus your research locally.