I have a big dataset df
(354903 rows) with two columns named df$ColumnName
and df$ColumnName.1
head(df)
CompleteName CompleteName.1
1 Lefebvre Arnaud Lefebvre Schuhl Anne
1.1 Lefebvre Arnaud Abe Lyu
1.2 Lefebvre Arnaud Abe Lyu
1.3 Lefebvre Arnaud Louvet Nicolas
1.4 Lefebvre Arnaud Muller Jean Michel
1.5 Lefebvre Arnaud De Dinechin Florent
I am trying to create labels to see weather the name is the same or not. When I try a small subset it works [1 if they are the same, 0 if not]:
> match(df$CompleteName[1], df$CompleteName.1[1], nomatch = 0)
[1] 0
> match(df$CompleteName[1:10], df$CompleteName.1[1:10], nomatch = 0)
[1] 0 0 0 0 0 0 0 0 0 0
But as soon as I throw the complete columns, it gives me complete different values, which seem nonsense to me:
> match(df$CompleteName, df$CompleteName.1, nomatch = 0)
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[23] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[45] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
Should I use sapply
? I did not figured it out, I tried this with an error:
sapply(df, function(x) match(x$CompleteName, x$CompleteName.1, nomatch = 0))
Please help!!!
From the man page of match,
‘match’ returns a vector of the positions of (first) matches of its first argument in its second.
So your data seem to indicate that the first match of "Lefebvre Arnaud" (the first position in the first argument) is in the row 101. I believe what you intended to do is a simple comparison, so that's just the equality operator ==
.
Some sample data:
> a <- rep ("Lefebvre Arnaud", 6)
> b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
> x <- data.frame(a,b, stringsAsFactors=F)
> x
a b
1 Lefebvre Arnaud Abe Lyu
2 Lefebvre Arnaud Abe Lyu
3 Lefebvre Arnaud Lefebvre Arnaud
4 Lefebvre Arnaud De Dinechin Florent
5 Lefebvre Arnaud De Dinechin Florent
6 Lefebvre Arnaud De Dinechin Florent
> x$a == x$b
[1] FALSE FALSE TRUE FALSE FALSE FALSE
EDIT: Also, you need to make sure that you are comparing apples to apples, so double check the data type of your columns. Use str(df)
to see whether the columns are strings or factors. You can either construct the matrix with "stringsAsFactors = FALSE", or convert from factor to character. There are several ways to do that, check here: Convert data.frame columns from factors to characters
As others have pointed out, match
isn't right here. What you want is equality, which you can get by testing with ==
, which gives you TRUE/FALSE
. Then using as.numeric
will give you desired 1/0
or using which
will give you the indices.
But you may still have an issue with factors!
# making up some similar data( adapted from earlier answer)
a <- rep ("Lefebvre Arnaud", 6)
b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
df <- data.frame(CompleteName = a, CompleteName.1 = b)
which(df$CompleteName == df$CompleteName1)
#integer(0)
#Warning message:
#In is.na(e2) : is.na() applied to non-(list or vector) of type 'NULL'
str(df)
# 'data.frame': 6 obs. of 2 variables:
# $ CompleteName : Factor w/ 1 level "Lefebvre Arnaud": 1 1 1 1 1 1
# $ CompleteName.1: Factor w/ 3 levels "Abe Lyu","De Dinechin Florent",..: 1 1 3 2 2 2
stringsAsFactors
Above, the data.frame wasn't constructed with stringsAsFactors=FALSE
and caused an error. Unfortunately, out of the box R
will coerce strings to factors on loading a csv
or creating a data.frame
. This can be fixed when creating the data.frame by explicitly specifying stringsAsFactors=FALSE
df <- data.frame(CompleteName = a, CompleteName.1 = b, stringsAsFactors = FALSE)
df[which(df$CompleteName == df$CompleteName.1), ]
## CompleteName CompleteName.1
## 3 Lefebvre Arnaud Lefebvre Arnaud
To avoid the issue in the future, run options(stringsAsFactors = FALSE)
at the beginning of your R session (or put it at the top of your .R
script). More discussion here:
Here's a solution using a data.table
with performance comparison to the data.frame
solution based on an identical number of records as in your case.
col1 = sample(x = letters, size = 354903, replace = TRUE)
col2 = sample(x = letters, size = 354903, replace = TRUE)
library(data.table)
dt = data.table(col1 = col1, col2 = col2)
df = data.frame(col1 = col1, col2 = col2)
# comparing the 2 columns
system.time(dt$col1==dt$col2)
system.time(df$col1==df$col2)
# storing the comparison in the table/frame itself
system.time(dt[, col3:= (col1==col2)])
system.time({df$col3 = (df$col1 == df$col2)})
The data.table
approach offers a significant speedup on my machine: from 0.020s to 0.008s.
Try it for yourself and see. I know this is not really significant with such a small number of rows but multiply that 1000 and you'll see a major difference!
来源:https://stackoverflow.com/questions/36345915/matching-two-columns-in-r