How to read data when some numbers contain commas as thousand separator?

后端 未结 11 1292
情书的邮戳
情书的邮戳 2020-11-22 02:29

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. \"1,513\" instead of 1513. Wh

相关标签:
11条回答
  • 2020-11-22 02:56

    I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.

    For example, if your file were like this:

    "1,234","123","1,234"
    "234","123","1,234"
    123,456,789
    

    Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2

    1234,"123",1234
    "234","123",1234
    123,456,789
    

    Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

    0 讨论(0)
  • 2020-11-22 02:57

    Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:

    y <- c("1,200","20,000","100","12,111")
    as.numeric(gsub(",", "", y))
    # [1]  1200 20000 100 12111
    

    This was also answered previously on R-Help (and in Q2 here).

    Alternatively, you can pre-process the file, for instance with sed in unix.

    0 讨论(0)
  • 2020-11-22 02:58

    "Preprocess" in R:

    lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
    

    Can use readLines on a textConnection. Then remove only the commas that are between digits:

    gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
    
    ## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
    

    It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).

    Edit: Later I discovered how to use colClasses by designing a new class. See:

    How to load df with 1000 separator in R as numeric class?

    0 讨论(0)
  • 2020-11-22 02:59

    Using read_delim function, which is part of readr library, you can specify additional parameter:

    locale = locale(decimal_mark = ",")
    
    read_delim("filetoread.csv", ';", locale = locale(decimal_mark = ","))
    

    *Semicolon in second line means that read_delim will read csv semicolon separated values.

    This will help to read all numbers with a comma as proper numbers.

    Regards

    Mateusz Kania

    0 讨论(0)
  • 2020-11-22 03:00

    I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

    x <- read.csv("file.csv",header=TRUE,colClasses="character")
    col2cvt <- 15:41
    x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
    
    0 讨论(0)
提交回复
热议问题