问题
I found some similar questions such as this one (about comparing attributes in XML files), this one (about a case where the compared values are numeric) and this one (about getting a number of columns that differ between two files) but nothing about this particular problem.
I have two CSV text files on which many, but not all, rows are equal. The files have the same amount of columns with same data type on the columns but they do not have the same amount of rows. The amount of rows on both files is around 120K and both files have some rows that are not on the other.
Simplified versions of these files would look as shown below.
File 1:
PROFILE.ID,CITY,STATE,USERID
2265,Miami,Florida,EL4950
4350,Nashville,Tennessee,GW7420
5486,Durango,Colorado,BH9012
R719,Flagstaff,Arizona,YT7460
Z551,Flagstaff,Arizona,ML1451
File 2:
PROFILE.ID,CITY,STATE,USERID
1173,Nashville,Tennessee,GW7420
2265,Miami,Florida,EL4950
R540,Flagstaff,Arizona,YT7460
T216,Durango,Colorado,BH9012
In the actual files many of the USERID
values in the first file can also be found in the second file (some may not be present however). Also while the USERID
values are unchanged for all users, their PROFILE.ID
may have changed.
The problem is that I would have to locate the rows where the PROFILE.ID
has changed.
I am thinking that I would have to use the following sequence of steps to analyze it in R:
- Load both files to R Studio as data frames
- Loop through the
USERID
column on the first file (which has more rows) - Search the second file for each
USERID
found in the first file - Return the corresponding
PROFILE.ID
from second file - Compare the returned value with what is in the first file
- Output the rows where the
PROFILE.ID
values differ
I was thinking of writing something like the code shown below but am not sure if there are better ways to accomplish this.
library(tidyverse)
con1 <- file("file1.csv", open = "r")
con2 <- file("file2.csv", open = "r")
file1 <- read.csv(con1, fill = F, colClasses = "character")
file2 <- read.csv(con2, fill = F, colClasses = "character")
for (i in seq(nrow(file1))) {
profIDFile1 <- file1$PROFILE.ID[i]
userIDFile1 <- file1$USERID[i]
profIDRowFile2 <- filter(file2, USERID == userIDFile1)
profIDFile2 <- profIDRowFile2$PROFILE.ID
if (profIDFile1 != profIDFile2) {
output < - profIDRowFile2
}
}
write.csv(output, file='result.csv', row.names=FALSE, quote=FALSE)
close(con1)
close(con2)
Question: Is there a package in R that can do this kind of comparison or what would be a good way to accomplish this in R script?
回答1:
I think you can do this with a simple join:
library(dplyr)
full_join(file1, file2, by = "USERID") %>%
filter(PROFILE.ID.x != PROFILE.ID.y)
# PROFILE.ID.x CITY.x STATE.x USERID PROFILE.ID.y CITY.y STATE.y
# 1 4350 Nashville Tennessee GW7420 1173 Nashville Tennessee
# 2 5486 Durango Colorado BH9012 T216 Durango Colorado
# 3 R719 Flagstaff Arizona YT7460 R540 Flagstaff Arizona
This shows that those three USERID
rows have differeing PROFILE.ID
fields. (The .x
are from file1
, .y
from file2
.)
That test does not deal very well with IDs that are missing in one, so you might add logic such as:
full_join(file1, file2, by = "USERID") %>%
filter(is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) |
PROFILE.ID.x != PROFILE.ID.y)
# PROFILE.ID.x CITY.x STATE.x USERID PROFILE.ID.y CITY.y STATE.y
# 1 4350 Nashville Tennessee GW7420 1173 Nashville Tennessee
# 2 5486 Durango Colorado BH9012 T216 Durango Colorado
# 3 R719 Flagstaff Arizona YT7460 R540 Flagstaff Arizona
# 4 Z551 Flagstaff Arizona ML1451 <NA> <NA> <NA>
The fourth row indicates an ID missing in file2
. This here is likely an artifact of a small sample dataset (which is good on SO :-), I'm not certain if this is interesting or meaningful to you.
回答2:
We can do this with base R
subset(merge(file, file2, by = 'USERID'), PROFILE.ID.x != PROFILE.ID.y)
来源:https://stackoverflow.com/questions/60021175/r-how-to-locate-and-compare-a-particular-element-between-two-csv-text-files