A number of predictors have been suggested to detect the most influential spreaders of information in online social media across various domains such as Twitter or Facebook. In particular, degree, PageRank, k-core and other centralities have been adopted to rank the spreading capability of users in information dissemination media. So far, validation of the proposed predictors has been done by simulating the spreading dynamics rather than following real information flow in social networks. Consequently, only model-dependent contradictory results have been achieved so far for the best predictor. Here, we address this issue directly. We search for in¡ãuential spreaders by following the real spreading dynamics in a wide range of networks. We find that there are plausible situations where the widely-used degree and PageRank fail in ranking users’ influence. We find that the best spreaders are consistently located in the k-core across dissimilar social platforms such as Twitter, Facebook, Livejournal and scientific publishing. Furthermore, when the complete global network structure is unavailable, we find that the sum of the nearest neighbors’ degree is a reliable local proxy for user’s influence. Our analysis provides practical instructions for optimal design of strategies for “viral” information dissemination in relevant applications.
The data that we used in this study can be downloaded here, compressed in rar format:
(1). APS Dataset This file contains the coauthorship and citations of all scientific papers published in Americal Physical Society (APS) journals until 2005, including Physical Review A, B, C, D, E and Physical Review Letters. Each node represents an author of scientific papers. If two author collaborate a paper, we construct a link between them. Also, if one author cites another author’s paper, we put a directed diffusion link from the cited author to the citing author.
Coauthorship: apscoauthor.csv This dataset contains the authorship relations for authors of scientific papers in Physical Review A, B, C, D, E and Physcial Review Letters.
Papers and authors are presented by numeric integer ID.
Paper 1 is written by author 1 and 2. Paper 2 is written by author 3.
Citations: apscitation.csv This dataset records the citation relations of authors in APS journals.
Authors’ IDs are same with those in apscoauthor.csv. Each row record a citation instance. Author with CitingID has cited the paper of author with CitedID in his/her own paper. There may be duplicated rows since one can refer other authors’ paper many times.
Author 183697 has cited the paper of author 24, 25 and 26 in his/her papers.
(2). Facebook Dataset This dataset is available online at http://socialnetworks.mpi-sws.org/data-wosn2009.html. It contains the friend relations of New Orleans Facebook social network as well as the wall posts records of users during a period of nearly two years. In the social network there are 63731 nodes with average degree 24.3. The total number of wall posts is 876992.
List of links
These files contain a list of all of the user-to-user links from the Facebook New Orleans networks. All links are treated as directed, even though they are undirected on Facebook.
Format: Gzipped ASCII. Each line contains two anonymized user identifiers, meaning the second user appeared in the first user’s friend list. Finally, the third column is a UNIX timestamp with the time of link establishment (if it could be determined, otherwise it is ‘\N’).
Data: Facebook Links (10.4MB)
List of wall posts
These files contain a list of all of the wall posts from the Facebook New Orleans networks.
Format: Gzipped ASCII. Each line contains two anonymized user identifiers, meaning the second user posted on the first user’s wall. The third column is a UNIX timestamp with the time of the wall post.
Data: Facebook Wall Posts (6.8MB)
(3). Twitter Dataset This file contains the mention network and retweet relations extracted from the tweets sampled between January 23rd and February 8th, 2011 provided by Twitter(http://trec.nist.gov/data/tweets/). We are not allowed to distribute any private information about the twitter users, so in the dataset each user is represented by an anonymized ID. There are 2870418 nodes with average indegree 1.7 in the largest component of mention networks. And the number of retweet instances is 1000221.
Twitter mention network: twitternet.csv This dataset contains the mention network constructed by mention relations between twitter users extracted from the tweets.
Users are presented by numeric integer ID. Each line represents a single record of mention instance. User with MentionerID has mentioned user with MentionedID at least once in his/her tweets. There are no duplicated rows in this dataset. Even though one user may mention another user many times, there only exists one link between them.
User 1 has mentioned user 2 and 3 in his/her tweets.
Retweet: retweet.csv This dataset records the retweet relations of users .
Users’ IDs are same with those in twitternet.csv. Each row record a retweet instance. User with RetwitterID has retweeted the tweet of user with TwitterID in his/her own tweet. There may be duplicated rows since one can retweet other users’ tweets many times.
User 2 and 3 has retweeted tweets of user 1.
For further information or help with the files, please contact Hernan Makse.