How to create a co-occurrence matrix calculated from combinations by ID/row in R?

♀尐吖头ヾ 提交于 2021-02-05 07:00:14



Thanks to @jazzurro for his anwer. It made me realize that the duplicates may just complicate things. I hope by keeping only unique values/row simplifies the task.*

df <- data.frame(ID = c(1,2,3,4,5), 
                  CTR1 = c("England", "England", "England", "China", "Sweden"),
                  CTR2 = c("England", "China", "China", "England", NA),
                  CTR3 = c("USA", "USA", "USA", "USA", NA),
                  CTR4 = c(NA, NA, NA, NA, NA),
                  CTR5 = c(NA, NA, NA, NA, NA),
                  CTR6 = c(NA, NA, NA, NA, NA))

1  England China   USA
2  England China   USA
3  England China   USA
4  China   England USA
5  Sweden

It is still the goal to create a co-occurrence matrix (now) based on the following four conditions:

  1. Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.

  2. A combination/co-occurrence should be counted as 1.

  3. Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.

  4. There is no value over 1 assigned to a combination by row/ID.

Aspired Result

         China   England   USA   Sweden

China    4        4         4      0

England  4        4         4      0        

USA      4        4         4      0

Sweden   0        0         0      0

*I've used the code from here to remove all non-unique observations.

Original Post

Assume I have a data set with a low two digit number of columns (some NA/empty) and more than 100.000 rows, represented by the following example dataframe

df <- data.frame(ID = c(1,2,3,4,5), 
                  CTR1 = c("England", "England", "England", "China", "England"),
                  CTR2 = c("England", "China", "China", "England", NA),
                  CTR3 = c("England", "China", "China", "England", NA),
                  CTR4 = c("China", "USA", "USA", "China", NA),
                  CTR5 = c("USA", "England", "USA", "USA", NA),
                  CTR6 = c("England", "China", "USA", "England", NA))


ID   CTR1    CTR2    CTR3    CTR4   CTR5    CTR6         
1    England England England China  USA     England 
2    England China   China   USA    England China
3    England China   China   USA    USA     USA  
4    China   England England China  USA     England
5    England 

and I want to count the co-occurrences by ID/row to get a co-occurrence matrix that sums up the co-occurence by ID/row only once, meaning that no value over 1 will be allocated to a combination (i.e. assign a value of 1 for the existence of a co-occurrence independent of in-row frequencies and order, assign a value of 0 for no co-occurrence/combination by ID/row),

1 England-England-England => 1
2 England-England => 1
3 England-China => 1
4 England- => 0

Another important aspects regards the counting of observations that appear once in a row but in combination with others, e.g. USA in row 1. They should get a value of 1 for their own co-occurrence (as they are in a combination even though not with themselves) so that the combination USA-USA also gets a value of 1 assigned.

1    England England England China  USA  England 
USA-USA => 1
China-China => 1
USA-China => 1
England-England => 1
England-USA => 1
England-China => 1

Due to the fact that row count should not >1 for a combination by row/ID, this results to:

        China   England   USA 

China    1        1         1        

England  1        1         1        

USA      1        1         1

This should lead to the following result based on the example dataframe, where a value of 4 is assigned to each combination based on the fact that each combination has occured at least in four rows and each string is part of a combination of the original dataframe:

         China   England   USA 

China    4        4         4        

England  4        4         4        

USA      4        4         4

So there are five conditions for counting:

  1. Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is not counted.
  2. A combination should be counted as 1.
  3. Observations occuring more than once do not contribute to a higher value for the interaction, i.e. several occurrences of the same country do not matter.
  4. Being in a combination (even in the case the same country does not appear twice in a row) results in counting as a self-combination, i.e. a value of 1 is assigned.
  5. There is no value over 1 assigned to a combination by row/ID.

I've tried to implement this by using dplyr, data.table, base aggregate or plyr adjusting code from [1], [2], [3], [4], [5] and [6] but as I don't care about order within a row but I also don't want to sum up all combinations within a row, I haven't got the aspired result so far.

I'm a novice in R. Any help is very much appreciated.



I modified your data so that data can represent your actual situation.

#   ID    CTR1    CTR2    CTR3  CTR4    CTR5    CTR6
#1:  1 England England England China     USA England
#2:  2 England   China   China   USA England   China
#3:  3 England   China   China   USA     USA     USA
#4:  4   China England England China     USA England
#5:  5  Sweden    <NA>    <NA>  <NA>            <NA>

df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England", 
"England", "China", "Sweden"), CTR2 = c("England", "China", "China", 
"England", NA), CTR3 = c("England", "China", "China", "England", 
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA", 
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA", 
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA, 


After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.

# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA. 

melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo

# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2

# Seeing this question, you want to create a matrix with crossprod().

crossprod(table(foo2[, c(1,3)])) -> mymat

# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.

diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)

#value     China England Sweden USA
#China       4       4      0   4
#England     4       4      0   4
#Sweden      0       0      0   0
#USA         4       4      0   4


Here is an option using base::table:

#get paired combi and remove those from same country
pairsDF <-, 
    by(df, df$ID, function(x) t(combn(unlist(x[-1L]), 2L)))))

#tabulate pairs
duppairs <- rbind(pairsDF, data.frame(V1=pairsDF$V2, V2=pairsDF$V1))
tab <- table(duppairs, useNA="no")

#set diagonals to be the count of countries if count is at least 2
cnt <- c(table(unlist(df[-1L])))
cnt[cnt==1L] <- 0L
diag(tab) <- cnt[names(diag(tab))]


V1        China England Sweden USA
  China       4       4      0   4
  England     4       4      0   4
  Sweden      0       0      0   0
  USA         4       4      0   4


df <- data.frame(ID = c(1,2,3,4,5), 
    CTR1 = c("England", "England", "England", "China", "Sweden"),
    CTR2 = c("China", "China", "China", "England", NA),
    CTR3 = c("USA", "USA", "USA", "USA", NA),
    CTR4 = c(NA, NA, NA, NA, NA),
    CTR5 = c(NA, NA, NA, NA, NA),
    CTR6 = c(NA, NA, NA, NA, NA))

