Merge multiple CSV files and remove duplicates in R

后端未结

关注

 1  563

I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across var

相关标签:

1条回答

孤街浪徒

2021-01-03 04:41

First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending '.csv', so something like

filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))

This should get you a data.frame with the contents of all the tweets

A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I'd handle those something like this:

read.csv('fred.csv', header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4))

Nb. changed so all columns are character, and ';' separated

I'd parse out the time later if it was needed...

A further separate issue is the uniqueness of the tweets within the data.frame - but I'm not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like

my.new.df <- my.df[!duplicated(my.df$tweet),]

For unique by author, I'd append the two fields - hard to know what works without the real data though!

my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]

So bringing it all together and assuming a few things along the way...

# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
    col.names=c('ID','tweet','author','local.time'),
    colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]

Based on the revised warnings after line 3, it's a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together...

Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:

filenames <- list.files(path = ".", pattern='^.*\\.csv$')

n.files.processed <- 0 # how many files did we process?
for (fnam in filenames) {
  cat('about to read from file:', fnam, '\n')
  if (exists('tmp.df')) rm(tmp.df)
  tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
             col.names=c('ID','tweet','author','local.time','extra'),
             colClasses=rep('character', 5)) 
  if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
    cat('  successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n')
    # now lets append a column containing the originating file name
    # so that debugging the file contents is easier
    tmp.df$fnam <- fnam

    # now lets rbind everything together
    if (exists('my.df')) {
      my.df <- rbind(my.df, tmp.df)
    } else {
      my.df <- tmp.df
    }
  } else {
    cat('  read NO rows from ', fnam, '\n')
  }
}
cat('processed ', n.files.processed, ' files\n')
my.new.df <- my.df[!duplicated(my.df$tweet),]

0 讨论(0)