I have compiled a dataset consisting of thousands of tweets using R.
The dataset basically looks like this:
Data <- data.frame(
X = c(1,2),
text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
screenname = c("author1", "author2")
)
Now I want to export this dataset to a Gephi supported graph format (see Supported Graph Formats - Gephi)
Whenever an "author" mentions a @user in the text, there should be a direct link from the author to the user. In the case above, the results should be like this:
author1 -> @User2
author1 -> @User3
author2 -> @User1
author2 -> @User3
How can I manipulate my dataset and export it to a Gephi supported Graph Format?
If possible, I would prefer GEXF or GraphML format. If that is not possible, I can also work with csv or a spreadsheet.
I thought about solving this problem the whole night and made a few steps in the right direction (at least I hope so). But I need your help.
As mentioned above, I have basically the following dataset:
Data <- data.frame(
X = c(1,2),
text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
screenname = c("author1", "author2")
)
I want to export it to a GEXF format to use it in Gephi.
There is a r package for exporting r data into GEXF, called rgexf (see https://bitbucket.org/gvegayon/rgexf/wiki/Installation). To use the write.gexf
function of the package, I need at least two things:
1) a matrix of all the nodes in the network (in my case authors, users and hashtags)
2) a matrix of all the edges between these nodes (i.e., the connections between authors and users as well as hashtags).
In my Twitter data, authors are never printed with "@", although they also can be "users". So I have at first to add "@" to the authors, to avoid duplication of nodes.
data$screenname <- sub("^", "@", data$screenname )
Then I need a matrix, consisting of all the nodes in my network (i.e., authors, users and hashtags). According to this example, the output should look like this:
people <- data.frame(matrix(c(1:9, '@author1', '@author2', '@user1', '@user2', '@user3', '#hashtag1', '#hashtag2', '#hashtag3', '#hashtag4'),ncol=2))
Then I need a matrix of all the edges between these nodes. According to this example, the output should look like this:
relations <- data.frame(matrix(c(1,3,1,4,1,5,1,6,1,7,2,4,2,3,2,5,2,8,2,9), ncol=2, byrow=T))
Finally, I only have to put these two things together:
write.gexf(people, relations)
to get the following file:
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" xmlns:viz="http://www.gexf.net/1.1draft/viz" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gexf.net/1.2draft http://www.gexf.net/1.2draft/gexf.xsd" version="1.2">
<meta lastmodifieddate="2015-02-04">
<creator>NodosChile</creator>
<description>A graph file writing in R using "rgexf"</description>
<keywords>gexf graph, NodosChile, R, rgexf</keywords>
</meta>
<graph mode="static" defaultedgetype="undirected">
<nodes>
<node id="1" label="@author1"/>
<node id="2" label="@author2"/>
<node id="3" label="@user1"/>
<node id="4" label="@user2"/>
<node id="5" label="@user3"/>
<node id="6" label="#hashtag1"/>
<node id="7" label="#hashtag2"/>
<node id="8" label="#hashtag3"/>
<node id="9" label="#hashtag4"/>
</nodes>
<edges>
<edge id="0" source="1" target="3" weight="1"/>
<edge id="1" source="1" target="4" weight="1"/>
<edge id="2" source="1" target="5" weight="1"/>
<edge id="3" source="1" target="6" weight="1"/>
<edge id="4" source="1" target="7" weight="1"/>
<edge id="5" source="2" target="4" weight="1"/>
<edge id="6" source="2" target="3" weight="1"/>
<edge id="7" source="2" target="5" weight="1"/>
<edge id="8" source="2" target="8" weight="1"/>
<edge id="9" source="2" target="9" weight="1"/>
</edges>
</graph>
</gexf>
But how can I automatically extract the nodes and the relations between these nodes (the edges) from the example above and write them to two matrices?
Does nobody know how to solve my problem?
I tried to figure out, how to extract the nodes from my example (i.e., the authors, users and hashtags) and saving them to a data.frame (I am sure there is a shorter and more elegant way to do it!):
#extract Users and Hashtags from text, Authors from screenname (and add @ to Author-names)
Users <- stri_extract_all(Data$text, regex = "@[A-Za-z0-9]+")
Hash <- stri_extract_all(Data$text, regex = "#[A-Za-z0-9]+")
Data$screenname <- sub("^", "@", Data$screenname )
Authors <- stri_extract_all(Data$screenname, regex = "@[A-Za-z0-9]+")
# delete NAs
Users <- Users[!is.na(Users)]
Hash <- Hash[!is.na(Hash)]
# converting lists to vectors
Users <- unlist(Users)
Hash <- unlist(Hash)
Authors <- unlist(Authors)
# merging the vectors to a single vector and deleting the duplicates
nodes <- unique(c(Authors, Users, Hash))
# saving the vectors in a data.frame and giving each node a unique ID
nodes <- data.frame(matrix(c(1:length(nodes), nodes), ncol=2))
colnames(nodes) <- c("ID", "label")
But how can I build a data.frame for the edges?
There must be a way to write a function which automatically checks if an author has mentioned a user and/or a hashtag row by row and write the result into a new data.frame, using the IDs of the authors, users and hashtags. Every connection should be displayed in two columns: source and target (1,2).
来源:https://stackoverflow.com/questions/28302705/exporting-twitter-data-to-gephi-using-r