问题
I have a data frame that looks like:
id fromuserid touserid from_country to_country length
1 1 54525953 47195889 US US 2
2 2 54525953 54361607 US US 1
3 3 54525953 53571081 US US 2
4 4 41943048 55379244 US US 1
5 5 47185938 53140304 US PR 1
6 6 47185938 54121387 US US 1
7 7 54525974 50928645 GB GB 1
8 8 54525974 53495302 GB GB 1
9 9 51380247 45214216 SG SG 2
10 10 51380247 43972484 SG US 2
Each row describes a number of messages (length) sent from one user to another user.
What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.
There are almost 200 countries. I use the function dcast as follows:
countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)
This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.
At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th
cell represents the messages sent from country i
to country j
. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.
So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?
回答1:
Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:
chats$from_country <- factor(chats$from_country,
levels = unique(c(chats$from_country,
chats$to_country)))
chats$to_country <- factor(chats$to_country,
levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# from_country US GB SG PR
# 1 US 5 0 0 1
# 2 GB 0 2 0 0
# 3 SG 1 0 1 0
# 4 PR 0 0 0 0
If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:
chats$from_country <- factor(chats$from_country,
levels = unique(c(levels(chats$from_country),
levels(chats$to_country)))
Why is this necessary? If they are already factors, then c(chats$from_country, chats$to_country)
will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in <NA>
.
来源:https://stackoverflow.com/questions/15392550/not-sure-why-dcast-this-data-set-results-in-dropping-variables