How to read data when some numbers contain commas as thousand separator?

后端 未结 11 1304
情书的邮戳
情书的邮戳 2020-11-22 02:29

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. \"1,513\" instead of 1513. Wh

相关标签:
11条回答
  • 2020-11-22 02:40

    If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))

    0 讨论(0)
  • 2020-11-22 02:42

    a dplyr solution using mutate_all and pipes

    say you have the following:

    > dft
    Source: local data frame [11 x 5]
    
       Bureau.Name Account.Code   X2014   X2015   X2016
    1       Senate          110 158,000 211,000 186,000
    2       Senate          115       0       0       0
    3       Senate          123  15,000  71,000  21,000
    4       Senate          126   6,000  14,000   8,000
    5       Senate          127 110,000 234,000 134,000
    6       Senate          128 120,000 159,000 134,000
    7       Senate          129       0       0       0
    8       Senate          130 368,000 465,000 441,000
    9       Senate          132       0       0       0
    10      Senate          140       0       0       0
    11      Senate          140       0       0       0
    

    and want to remove commas from the year variables X2014-X2016, and convert them to numeric. also, let's say X2014-X2016 are read in as factors (default)

    dft %>%
        mutate_all(funs(as.character(.)), X2014:X2016) %>%
        mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
        mutate_all(funs(as.numeric(.)), X2014:X2016)
    

    mutate_all applies the function(s) inside funs to the specified columns

    I did it sequentially, one function at a time (if you use multiple functions inside funs then you create additional, unnecessary columns)

    0 讨论(0)
  • 2020-11-22 02:42

    A very convenient way is readr::read_delim-family. Taking the example from here: Importing csv with multiple separators into R you can do it as follows:

    txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
    1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
    2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
    3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'
    
    require(readr)
    read_csv(txt) # = read_delim(txt, delim = ",")
    

    Which results in the expected result:

    # A tibble: 3 × 6
      OBJECTID District_N ZONE_CODE  COUNT        AREA      SUM
         <int>      <chr>     <int>  <dbl>       <dbl>    <dbl>
    1        1   Bagamoyo         1 136227  8514187500 352678.8
    2        2    Bariadi         2  88350  5521875000 526307.3
    3        3     Chunya         3 483059 30191187500 352444.7
    
    0 讨论(0)
  • 2020-11-22 02:45

    You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:

    setClass("num.with.commas")
    setAs("character", "num.with.commas", 
            function(from) as.numeric(gsub(",", "", from) ) )
    

    Then run read.csv like:

    DF <- read.csv('your.file.here', 
       colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
    
    0 讨论(0)
  • 2020-11-22 02:45

    This question is several years old, but I stumbled upon it, which means maybe others will.

    The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.

    library(readr)
    read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
              col_types = list(col_numeric())
            )
    

    This yields

    Source: local data frame [4 x 1]

      numbers
        (dbl)
    1   800.0
    2  1800.0
    3  3500.0
    4     6.5
    

    An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)

    For instance, if I had not flagged the col_types, I would have gotten this:

    > read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
    Source: local data frame [4 x 1]
    
      numbers
        (chr)
    1     800
    2   1,800
    3    3500
    4     6.5
    

    (Notice that it is now a chr (character) instead of a numeric.)

    Or, more dangerously, if it were long enough and most of the early elements did not contain commas:

    > set.seed(1)
    > tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
    > tmp <- c(tmp, "1,003")
    > tmp <- paste(tmp, collapse="\"\n\"")
    

    (such that the last few elements look like:)

    \"5\"\n\"9\"\n\"7\"\n\"1,003"
    

    Then you'll find trouble reading that comma at all!

    > tail(read_csv(tmp))
    Source: local data frame [6 x 1]
    
         3"
      (dbl)
    1 8.000
    2 5.000
    3 5.000
    4 9.000
    5 7.000
    6 1.003
    Warning message:
    1 problems parsing literal data. See problems(...) for more details. 
    
    0 讨论(0)
  • 2020-11-22 02:46

    We can also use readr::parse_number, the columns must be characters though. If we want to apply it for multiple columns we can loop through columns using lapply

    df[2:3] <- lapply(df[2:3], readr::parse_number)
    df
    
    #  a        b        c
    #1 a    12234       12
    #2 b      123  1234123
    #3 c     1234     1234
    #4 d 13456234    15342
    #5 e    12312 12334512
    

    Or use mutate_at from dplyr to apply it to specific variables.

    library(dplyr)
    df %>% mutate_at(2:3, readr::parse_number)
    #Or
    df %>% mutate_at(vars(b:c), readr::parse_number)
    

    data

    df <- data.frame(a = letters[1:5], 
                     b = c("12,234", "123", "1,234", "13,456,234", "123,12"),
                     c = c("12", "1,234,123","1234", "15,342", "123,345,12"), 
                     stringsAsFactors = FALSE)
    
    0 讨论(0)
提交回复
热议问题