Exporting twitter data to Gephi using R

梦想与她 提交于 2019-12-21 02:49:08

问题


I have compiled a dataset consisting of thousands of tweets using R.

The dataset basically looks like this:

Data <- data.frame(
  X = c(1,2),
  text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
  screenname = c("author1", "author2")
)

Now I want to export this dataset to a Gephi supported graph format (see Supported Graph Formats - Gephi)

Whenever an "author" mentions a @user in the text, there should be a direct link from the author to the user. In the case above, the results should be like this:

author1 -> @User2

author1 -> @User3

author2 -> @User1

author2 -> @User3

How can I manipulate my dataset and export it to a Gephi supported Graph Format?

If possible, I would prefer GEXF or GraphML format. If that is not possible, I can also work with csv or a spreadsheet.


I thought about solving this problem the whole night and made a few steps in the right direction (at least I hope so). But I need your help.

As mentioned above, I have basically the following dataset:

Data <- data.frame(
  X = c(1,2),
  text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
  screenname = c("author1", "author2")
)

I want to export it to a GEXF format to use it in Gephi.

There is a r package for exporting r data into GEXF, called rgexf (see https://bitbucket.org/gvegayon/rgexf/wiki/Installation). To use the write.gexf function of the package, I need at least two things:

1) a matrix of all the nodes in the network (in my case authors, users and hashtags)

2) a matrix of all the edges between these nodes (i.e., the connections between authors and users as well as hashtags).

In my Twitter data, authors are never printed with "@", although they also can be "users". So I have at first to add "@" to the authors, to avoid duplication of nodes.

data$screenname <- sub("^", "@", data$screenname )

Then I need a matrix, consisting of all the nodes in my network (i.e., authors, users and hashtags). According to this example, the output should look like this:

people <- data.frame(matrix(c(1:9, '@author1', '@author2', '@user1', '@user2', '@user3', '#hashtag1', '#hashtag2', '#hashtag3', '#hashtag4'),ncol=2))

Then I need a matrix of all the edges between these nodes. According to this example, the output should look like this:

relations <- data.frame(matrix(c(1,3,1,4,1,5,1,6,1,7,2,4,2,3,2,5,2,8,2,9), ncol=2, byrow=T))

Finally, I only have to put these two things together:

write.gexf(people, relations)

to get the following file:

<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gexf.net/1.2draft http://www.gexf.net/1.2draft/gexf.xsd" version="1.2">
  <meta lastmodifieddate="2015-02-04">
    <creator>NodosChile</creator>
    <description>A graph file writing in R using "rgexf"</description>
    <keywords>gexf graph, NodosChile, R, rgexf</keywords>
  </meta>
  <graph mode="static" defaultedgetype="undirected">
    <nodes>
      <node id="1" label="@author1"/>
      <node id="2" label="@author2"/>
      <node id="3" label="@user1"/>
      <node id="4" label="@user2"/>
      <node id="5" label="@user3"/>
      <node id="6" label="#hashtag1"/>
      <node id="7" label="#hashtag2"/>
      <node id="8" label="#hashtag3"/>
      <node id="9" label="#hashtag4"/>
    </nodes>
    <edges>
      <edge id="0" source="1" target="3" weight="1"/>
      <edge id="1" source="1" target="4" weight="1"/>
      <edge id="2" source="1" target="5" weight="1"/>
      <edge id="3" source="1" target="6" weight="1"/>
      <edge id="4" source="1" target="7" weight="1"/>
      <edge id="5" source="2" target="4" weight="1"/>
      <edge id="6" source="2" target="3" weight="1"/>
      <edge id="7" source="2" target="5" weight="1"/>
      <edge id="8" source="2" target="8" weight="1"/>
      <edge id="9" source="2" target="9" weight="1"/>
    </edges>
  </graph>
</gexf>

But how can I automatically extract the nodes and the relations between these nodes (the edges) from the example above and write them to two matrices?

Does nobody know how to solve my problem?

I tried to figure out, how to extract the nodes from my example (i.e., the authors, users and hashtags) and saving them to a data.frame (I am sure there is a shorter and more elegant way to do it!):

#extract Users and Hashtags from text, Authors from screenname (and add @ to Author-names)
Users <- stri_extract_all(Data$text, regex = "@[A-Za-z0-9]+")
Hash <- stri_extract_all(Data$text, regex = "#[A-Za-z0-9]+")
Data$screenname <- sub("^", "@", Data$screenname )
Authors <- stri_extract_all(Data$screenname, regex = "@[A-Za-z0-9]+")
# delete NAs
Users <- Users[!is.na(Users)]
Hash <- Hash[!is.na(Hash)]
# converting lists to vectors
Users <- unlist(Users)
Hash <- unlist(Hash)
Authors <- unlist(Authors)
# merging the vectors to a single vector and deleting the duplicates
nodes <- unique(c(Authors, Users, Hash))
# saving the vectors in a data.frame and giving each node a unique ID
nodes <- data.frame(matrix(c(1:length(nodes), nodes), ncol=2))
colnames(nodes) <- c("ID", "label")

But how can I build a data.frame for the edges?

There must be a way to write a function which automatically checks if an author has mentioned a user and/or a hashtag row by row and write the result into a new data.frame, using the IDs of the authors, users and hashtags. Every connection should be displayed in two columns: source and target (1,2).

来源:https://stackoverflow.com/questions/28302705/exporting-twitter-data-to-gephi-using-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!