One of the areas in which our group does research is analysis of networks based on large-scale Twitter data. Students and postdocs hoping to join this group and do this type of research should familiarize themselves with the methods outlined below:

  1. Look at the lesson in Twitter analysis by postdoc Alexandre Bovet.
  2. Create a Twitter account and a Twitter app.
  3. Download the necessary code and follow the instructions in the README to begin collecting data using the Twitter credentials you created.
2016 US Presidential Election Twitter Dataset:
We release here the two datasets used in our study of the 2016 US Presidential Election.
The first one contains the tweet IDs of 170 million tweets from 11 million users posting about the election between June 1st 2016 until November 9th 2016.
It is used in our paper:

See README file and dataset.

The second one contains the retweet networks and the tweet IDs that have aURL directing toward a news outlet website of the corresponding media category.
It is used in our paper:
Bovet, A. & Makse, H. A. Influence of fake news in Twitter during the 2016 US presidential election. Nat. Commun. 10, 7 (2019).
See README file and dataset.
These softwares: https://github.com/DocNow/hydratorhttps://github.com/DocNow/twarc and http://www.tweepy.org/ can be used to “rehydrate” the tweet_IDs, i.e. download the full tweet objects using the tweet_IDs.

There is some additional information that may be helpful:

  • This code works using the REST API. The rate limit is 450 requests of at maximum 100 tweets per 15 minutes, meaning 45000 tweets per 15 minutes. See the following for further information:
    https://developer.twitter.com/en/docs/basics/rate-limits
    https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets
  • The Streaming API, which is not rate-limited, can also be used, but it only yields around 1% of the total amount of tweets. It may, in this case, be simpler to use another existing Python library rather than the code provided above, as in the lesson by postdoc Alexandre Bovet.
  • The library used in the lesson is called “Tweepy” and prospective users should read the documentation:
    http://www.tweepy.org/
    http://tweepy.readthedocs.io/en/v3.5.0/streaming_how_to.html
  • It is recommended to collect tweets using both the REST API plus the code, and the Streaming API plus Tweepy, with the same keywords. This will allow for a comparison of the results and biases of each method.
  • To collect tweets from a certain user, either API can be used. The REST API must be run periodically to collect the latest tweets, while the Streaming API will collect continuously (but may not give all of the tweets if there is a lot of activity and many tracked users).
  • The REST API is rate-limited per app, so anyone using it will need to create several apps to track different topics to the maximum extent.
  • The Streaming API is limited per user, so anyone using it will need to create several Twitter accounts (with several different IP addresses so as not to be banned by Twitter).

The key here is to carefully read all of the documentation–many of your questions will be answered there.

One breakthrough of our lab is a method to find the most influential nodes in a random network via optimal percolation, with the optimization problem then solved by our Collective Influence (CI) algorithm. While the majority of methods to determine the most influential nodes are trial-and-error-based, ours is based in complex network theory.

Please find here the dataset of Twitter retweets used in the associated paper of Morone and Makse, Influence maximization in complex networks through optimal percolation, Nature 524, 65–68 (2015).