问题
I'm a rookie in R and currently working with collaboration data in the form of an edge list with 32 columns and around 200.000 rows. I want to create a (co-)occurrence matrix based on the interaction between countries. However, I want to count the number of interactions by the total number of an object.
Basic Example of Aspired Outcome
If in one row "England" occurs three times and "China" only one time, the result should be the following matrix.
England China
England 3 3
China 3 1
Reproducible example
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
Accordingly, an example data frame currently looks like this:
ID V1 V2 ... V32
1 England Greece USA
2 England England China
3 China Greece Greece
4 England England England
.
.
.
Aspired outcome
I want to count (co-)occurrences row-wise and independent of order to get a (co-)occurrence matrix that accounts for low frequencies of edge loops (e.g. England-England), which leads to the following result:
China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
What has been tried so far
I've used igraph
to get an adjacency matrix with co-occurrences. However, it calculates - as supposed to - not more than two interactions of the same two objects, leaving me with values far below actual frequency of objects by row/publication in some cases.
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"))
# remove ID column
df[1] <- list(NULL)
# calculate co-occurrences and return as dataframe
library(igraph)
library(Matrix)
countrydf <- graph.data.frame(df)
countrydf2 <- as_adjacency_matrix(countrydf, type = "both", edges = FALSE)
countrydf3 <- as.data.frame(as.matrix(forceSymmetric(countrydf2)))
China England Greece USA
China 0 0 1 0
England 0 2 1 0
Greece 1 1 0 0
USA 0 0 0 0
I assume there has to be an easy solution using base
and/or dplyr
and /or table
and/or reshape2
similar to [1], [2], [3], [4] or [5] but nothing has done the trick so far and I was not able to adjust the code to my needs. I've also tried to use [6] as a basis, however, the same issue applies here, too.
library(tidry)
library(dplyr)
library(stringr)
# collapse observations into one column
df2 <- df %>% unite(concat, V1:V32, sep = ",")
# calculate weights
df3 <- df2$concat %>%
str_split(",") %>%
lapply(function(x){
expand.grid(x,x,x,x, w = length(x), stringsAsFactors = FALSE)
}) %>%
bind_rows
df4 <- apply(df3[, -5], 1, sort) %>%
t %>%
data.frame(stringsAsFactors = FALSE) %>%
mutate(w = df3$w)
I'd be glad if someone could point me in the right direction.
回答1:
There may be better ways to do this, but try:
library(tidyverse)
df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>%
crossprod(., .)
df_diag <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>%
diag()
diag(df1) <- df_diag
df1
Country China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
回答2:
Here is a way using dplyr and tidyr packages, the whole idea lies in creating a dataframe with row-wise occurrence of each country then joining it on itself.
library(dplyr)
# Create dataframe sammple
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V32 = c("USA", "China", "Greece", "England"),
stringsAsFactors = FALSE)
# Get the occurance of each country in every row.
row_occurance <-
df %>%
tidyr::gather(key = "identifier", value = "country", -ID) %>%
group_by(ID, country) %>%
count()
row_occurance %>%
# Join row_occurance on itself to simulate the matrix
left_join(row_occurance, by = "ID") %>%
# Get the highest occurance row wise, this to handle when country
# name is repeated within same row
mutate(Occurance = pmax(n.x, n.y)) %>%
# Group by 2 countries
group_by(country.x, country.y) %>%
# Sum the occurance of 2 countries together
summarise(Occurance = sum(Occurance)) %>%
# Spread the data to make it in matrix format
tidyr::spread(key = "country.y", value = "Occurance", fill = 0)
# # A tibble: 4 x 5
# # Groups: country.x [4]
# country.x China England Greece USA
# <chr> <dbl> <dbl> <dbl> <dbl>
# China 2 2 2 0
# England 2 6 1 1
# Greece 2 1 3 1
# USA 0 1 1 1
回答3:
An option using base::table
:
df <- data.frame(ID = c(1,2,3,4),
V1 = c("England", "England", "China", "England"),
V2 = c("Greece", "England", "Greece", "England"),
V3 = c("USA", "China", "Greece", "England"))
#get paired combi and remove those from same country
pairs <- as.data.frame(do.call(rbind,
by(df, df$ID, function(x) t(combn(as.character(x[-1L]), 2L)))))
pairs <- pairs[pairs$V1!=pairs$V2, ]
#repeat data frame with columns swap so that
#upper and lower tri have same numbers and all countries are shown
pairs <- rbind(pairs, data.frame(V1=pairs$V2, V2=pairs$V1))
#tabulate pairs
tab <- table(pairs)
#set diagonals to be the count of countries
cnt <- c(table(unlist(df[-1L])))
diag(tab) <- cnt[names(diag(tab))]
tab
output:
V2
V1 China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
来源:https://stackoverflow.com/questions/59632794/how-to-calculate-a-co-occurrence-matrix-from-a-data-frame-with-several-columns