问题
I have 2 extracted data sets from a dataset called babies2009( 3 vectors count, name, gender )
One is girls2009 containing all the girls and the other boys2009. I want to find out what similar names exist between boys and girls.
I tried this
common.names = (boys2009$name %in% girls2009$name)
When I try
babies2009[common.names, ] [1:10, ]
all I get is the girl names not the common names.
I have confirmed that both data sets indeed contain boys and girls respectively by doing taking a 10 sample...
boys2009 [1:10,]
girsl2009 [1:10,]
How else can I compare the 2 datasets and determine what values they both share. Thanks,
回答1:
common.names = (boys2009$name %in% girls2009$name)
gives you a logical vector of length length(boys2009$name)
. So when you try selecting from a much longer data.frame babies2009[common.names, ] [1:10, ]
, you wind up with nonsense.
Solution: use that logical vector on the proper data.frame!
boys2009 <- data.frame( names=c("Billy","Bob"),data=runif(2), gender="M" , stringsAsFactors=FALSE)
girls2009 <- data.frame( names=c("Billy","Mae","Sue"),data=runif(3), gender="F" , stringsAsFactors=FALSE)
babies2009 <- rbind(boys2009,girls2009)
common.names <- (boys2009$name %in% girls2009$name)
> boys2009[common.names, ]$names
[1] "Billy"
回答2:
Since you want similarities but did not specify exact matches, you should consider agrep
sapply(boys2009$name , agrep, girls2009$name, max = 0.1)
You can adjust the max.distance argument to suit your needs.
回答3:
How about using set functions:
list(
`only boys` = setdiff(boys2009$name, girls2009$name),
`common` = intersect(boys2009$name, girls2009$name),
`only girls` = setdiff(girls2009$name, boys2009$name)
)
来源:https://stackoverflow.com/questions/7459138/comparing-2-datasets-in-r