问题
Update
Thanks to @jazzurro for his anwer. It made me realize that the duplicates may just complicate things. I hope by keeping only unique values/row simplifies the task.*
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "Sweden"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("USA", "USA", "USA", "USA", NA),
CTR4 = c(NA, NA, NA, NA, NA),
CTR5 = c(NA, NA, NA, NA, NA),
CTR6 = c(NA, NA, NA, NA, NA))
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England China USA
2 England China USA
3 England China USA
4 China England USA
5 Sweden
It is still the goal to create a co-occurrence matrix (now) based on the following four conditions:
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.
A combination/co-occurrence should be counted as 1.
Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.
Aspired Result
China England USA Sweden
China 4 4 4 0
England 4 4 4 0
USA 4 4 4 0
Sweden 0 0 0 0
*I've used the code from here to remove all non-unique observations.
Original Post
Assume I have a data set with a low two digit number of columns (some NA/empty) and more than 100.000 rows, represented by the following example dataframe
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "England"),
CTR2 = c("England", "China", "China", "England", NA),
CTR3 = c("England", "China", "China", "England", NA),
CTR4 = c("China", "USA", "USA", "China", NA),
CTR5 = c("USA", "England", "USA", "USA", NA),
CTR6 = c("England", "China", "USA", "England", NA))
df
ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 England England England China USA England
2 England China China USA England China
3 England China China USA USA USA
4 China England England China USA England
5 England
and I want to count the co-occurrences by ID/row to get a co-occurrence matrix that sums up the co-occurence by ID/row only once, meaning that no value over 1 will be allocated to a combination (i.e. assign a value of 1 for the existence of a co-occurrence independent of in-row frequencies and order, assign a value of 0 for no co-occurrence/combination by ID/row),
1 England-England-England => 1
2 England-England => 1
3 England-China => 1
4 England- => 0
Another important aspects regards the counting of observations that appear once in a row but in combination with others, e.g. USA in row 1. They should get a value of 1 for their own co-occurrence (as they are in a combination even though not with themselves) so that the combination USA-USA also gets a value of 1 assigned.
1 England England England China USA England
USA-USA => 1
China-China => 1
USA-China => 1
England-England => 1
England-USA => 1
England-China => 1
Due to the fact that row count should not >1 for a combination by row/ID, this results to:
China England USA
China 1 1 1
England 1 1 1
USA 1 1 1
This should lead to the following result based on the example dataframe, where a value of 4 is assigned to each combination based on the fact that each combination has occured at least in four rows and each string is part of a combination of the original dataframe:
China England USA
China 4 4 4
England 4 4 4
USA 4 4 4
So there are five conditions for counting:
- Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is not counted.
- A combination should be counted as 1.
- Observations occuring more than once do not contribute to a higher value for the interaction, i.e. several occurrences of the same country do not matter.
- Being in a combination (even in the case the same country does not appear twice in a row) results in counting as a self-combination, i.e. a value of 1 is assigned.
- There is no value over 1 assigned to a combination by row/ID.
I've tried to implement this by using dplyr
, data.table
, base aggregate
or plyr
adjusting code from [1], [2], [3], [4], [5] and [6] but as I don't care about order within a row but I also don't want to sum up all combinations within a row, I haven't got the aspired result so far.
I'm a novice in R. Any help is very much appreciated.
回答1:
DATA
I modified your data so that data can represent your actual situation.
# ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
#1: 1 England England England China USA England
#2: 2 England China China USA England China
#3: 3 England China China USA USA USA
#4: 4 China England England China USA England
#5: 5 Sweden <NA> <NA> <NA> <NA>
df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England",
"England", "China", "Sweden"), CTR2 = c("England", "China", "China",
"England", NA), CTR3 = c("England", "China", "China", "England",
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA",
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA",
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))
UPDATE
After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.
# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA.
melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo
# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2
# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().
crossprod(table(foo2[, c(1,3)])) -> mymat
# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)
#value
#value China England Sweden USA
#China 4 4 0 4
#England 4 4 0 4
#Sweden 0 0 0 0
#USA 4 4 0 4
回答2:
Here is an option using base::table
:
#get paired combi and remove those from same country
pairsDF <- as.data.frame(do.call(rbind,
by(df, df$ID, function(x) t(combn(unlist(x[-1L]), 2L)))))
#tabulate pairs
duppairs <- rbind(pairsDF, data.frame(V1=pairsDF$V2, V2=pairsDF$V1))
tab <- table(duppairs, useNA="no")
#set diagonals to be the count of countries if count is at least 2
cnt <- c(table(unlist(df[-1L])))
cnt[cnt==1L] <- 0L
diag(tab) <- cnt[names(diag(tab))]
output:
V2
V1 China England Sweden USA
China 4 4 0 4
England 4 4 0 4
Sweden 0 0 0 0
USA 4 4 0 4
data:
df <- data.frame(ID = c(1,2,3,4,5),
CTR1 = c("England", "England", "England", "China", "Sweden"),
CTR2 = c("China", "China", "China", "England", NA),
CTR3 = c("USA", "USA", "USA", "USA", NA),
CTR4 = c(NA, NA, NA, NA, NA),
CTR5 = c(NA, NA, NA, NA, NA),
CTR6 = c(NA, NA, NA, NA, NA))
来源:https://stackoverflow.com/questions/59662687/how-to-create-a-co-occurrence-matrix-calculated-from-combinations-by-id-row-in-r