What can R do about a messy data format?

前端 未结 6 1806
时光说笑
时光说笑 2020-11-28 08:27

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and

相关标签:
6条回答
  • 2020-11-28 08:54
    md_table <- scan(text = "
    +------------+------+------+----------+--------------------------+
    |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
    +------------+------+------+----------+--------------------------+
    | 2018-06-01 | A    | A1   |        0 |                        0 |
    | 2018-06-03 | A    | A2   |        0 |                        1 |
    | 2018-06-03 | A    | A3   |        0 |                        2 |
    | 2018-06-03 | A    | A4   |        1 |                        1 |
    | 2018-06-03 | A    | A5   |        2 |                        1 |
    | 2018-06-04 | A    | A6   |        0 |                        3 |
    | 2018-06-01 | B    | B1   |        0 |                        1 |
    | 2018-06-02 | B    | B2   |        0 |                        2 |
    | 2018-06-03 | B    | B3   |        0 |                        3 |
    +------------+------+------+----------+--------------------------+",
    what = "", sep = "", comment.char = "+", quiet = TRUE)
    
    ## it is clear that there are 5 columns
    mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
    #      [,1]         [,2]   [,3]   [,4]       [,5]                      
    # [1,] "Date"       "Emp1" "Case" "Priority" "PriorityCountinLast7days"
    # [2,] "2018-06-01" "A"    "A1"   "0"        "0"                       
    # [3,] "2018-06-03" "A"    "A2"   "0"        "1"                       
    # [4,] "2018-06-03" "A"    "A3"   "0"        "2"                       
    # [5,] "2018-06-03" "A"    "A4"   "1"        "1"                       
    # [6,] "2018-06-03" "A"    "A5"   "2"        "1"                       
    # [7,] "2018-06-04" "A"    "A6"   "0"        "3"                       
    # [8,] "2018-06-01" "B"    "B1"   "0"        "1"                       
    # [9,] "2018-06-02" "B"    "B2"   "0"        "2"                       
    #[10,] "2018-06-03" "B"    "B3"   "0"        "3"
    

    ## a data frame with all character columns
    dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
    #        Date Emp1 Case Priority PriorityCountinLast7days
    #1 2018-06-01    A   A1        0                        0
    #2 2018-06-03    A   A2        0                        1
    #3 2018-06-03    A   A3        0                        2
    #4 2018-06-03    A   A4        1                        1
    #5 2018-06-03    A   A5        2                        1
    #6 2018-06-04    A   A6        0                        3
    #7 2018-06-01    B   B1        0                        1
    #8 2018-06-02    B   B2        0                        2
    #9 2018-06-03    B   B3        0                        3
    

    ## or maybe just use `type.convert` on some columns?
    dat[] <- lapply(dat, type.convert)
    
    0 讨论(0)
  • 2020-11-28 08:58

    The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

    The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

    Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

    dat <- read.table(text = "
    +------------+------+------+----------+--------------------------+
    |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
    +------------+------+------+----------+--------------------------+
    | 2018-06-01 | A    | A1   |        0 |                        0 |
    | 2018-06-03 | A    | A2   |        0 |                        1 |
    | 2018-06-03 | A    | A3   |        0 |                        2 |
    | 2018-06-03 | A    | A4   |        1 |                        1 |
    | 2018-06-03 | A    | A5   |        2 |                        1 |
    | 2018-06-04 | A    | A6   |        0 |                        3 |
    | 2018-06-01 | B    | B1   |        0 |                        1 |
    | 2018-06-02 | B    | B2   |        0 |                        2 |
    | 2018-06-03 | B    | B3   |        0 |                        3 |
    +------------+------+------+----------+--------------------------+
    ", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)
    

    But as you can see there are some issues with the result.

    dat
       X       Date Emp1 Case Priority PriorityCountinLast7days X.1
    1 NA 2018-06-01    A   A1        0                        0  NA
    2 NA 2018-06-03    A   A2        0                        1  NA
    3 NA 2018-06-03    A   A3        0                        2  NA
    4 NA 2018-06-03    A   A4        1                        1  NA
    5 NA 2018-06-03    A   A5        2                        1  NA
    6 NA 2018-06-04    A   A6        0                        3  NA
    7 NA 2018-06-01    B   B1        0                        1  NA
    8 NA 2018-06-02    B   B2        0                        2  NA
    9 NA 2018-06-03    B   B3        0                        3  NA
    

    To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

    So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

    dat <- dat[-c(1, ncol(dat))]
    dat
              Date   Emp1   Case Priority PriorityCountinLast7days
    1  2018-06-01   A      A1           0                        0
    2  2018-06-03   A      A2           0                        1
    3  2018-06-03   A      A3           0                        2
    4  2018-06-03   A      A4           1                        1
    5  2018-06-03   A      A5           2                        1
    6  2018-06-04   A      A6           0                        3
    7  2018-06-01   B      B1           0                        1
    8  2018-06-02   B      B2           0                        2
    9  2018-06-03   B      B3           0                        3
    

    That wasn't too hard, much better.
    In this case there is still a problem, to coerce column Date to class Date.

    dat$Date <- as.Date(dat$Date)
    

    And the result is satisfactory.

    str(dat)
    'data.frame':   9 obs. of  5 variables:
     $ Date                    : Date, format: "2018-06-01" "2018-06-03" ...
     $ Emp1                    : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
     $ Case                    : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
     $ Priority                : int  0 0 0 1 2 0 0 0 0
     $ PriorityCountinLast7days: int  0 1 2 1 1 3 1 2 3
    

    Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

    The whole process took only 3 lines of base R code.

    Finally, the end result in dput format, like it should be in the first place.

    dat <-
    structure(list(Date = structure(c(17683, 17685, 17685, 17685, 
    17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", 
    "A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", 
    "A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 
    0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 
    1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")
    
    0 讨论(0)
  • 2020-11-28 09:01

    Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.

    1. Copy the dataset into the Notepad file.
    2. Replace all | characters with ,
    3. Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).

    But, if you mean use the R to fully understand it in one step, then I have no idea.

    0 讨论(0)
  • The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.

    I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.


    I've since come across a couple more cases and added them to the test suite.

    I've also included a table drawn using box-drawing characters. You don't come across this much these days, but for completeness' sake it's here.

    x1 <- "
    +------------+------+------+----------+--------------------------+
    |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
    +------------+------+------+----------+--------------------------+
    | 2018-06-01 | A    | A1   |        0 |                        0 |
    | 2018-06-03 | A    | A2   |        0 |                        1 |
    | 2018-06-02 | B    | B2   |        0 |                        2 |
    | 2018-06-03 | B    | B3   |        0 |                        3 |
    +------------+------+------+----------+--------------------------+
    "
    
    x2 <- "
    ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
        Date    | Emp1 | Case | Priority | PriorityCountinLast7days 
    ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
     2018-06-01 | A    | A|1  |        0 |                        0 
     2018-06-03 | A    | A|2  |        0 |                        1 
     2018-06-02 | B    | B|2  |        0 |                        2 
     2018-06-03 | B    | B|3  |        0 |                        3 
    ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
    "
    
    x3 <- "
     Maths | English | Science | History | Class
    
      0.1  |  0.2    |  0.3    |  0.2    |  Y2
    
      0.9  |  0.5    |  0.7    |  0.4    |  Y1
    
      0.2  |  0.4    |  0.6    |  0.2    |  Y2
    
      0.9  |  0.5    |  0.2    |  0.7    |  Y1
    "
    
    x4 <- "
           Season   |   Team  | W | AHWO
    -------------------------------------
    1  |  2017/2018 |  TeamA  | 2 | 1.75
    2  |  2017/2018 |  TeamB  | 1 | 1.85
    3  |  2017/2018 |  TeamC  | 1 | 1.70
    4  |  2016/2017 |  TeamA  | 1 | 1.49
    5  |  2016/2017 |  TeamB  | 3 | 1.51
    6  |  2016/2017 |  TeamC  | 2 | N/A
    "
    
    x5 <- "
        A   B   C
      ┌───┬───┬───┐
    A │ 5 │ 1 │ 4 │
      ├───┼───┼───┤
    B │ 2 │ 5 │ 3 │
      ├───┼───┼───┤
    C │ 3 │ 4 │ 4 │
      └───┴───┴───┘
    "
    
    x6 <- "
    ------------------------------------------------------------
    |date              |Material          |Description         |
    |----------------------------------------------------------|
    |10/04/2013        |WM.5597394        |PNEUMATIC           |
    |11/07/2013        |GB.D040790        |RING                |
    ------------------------------------------------------------
    ------------------------------------------------------------
    |date              |Material          |Description         |
    |----------------------------------------------------------|
    |08/06/2013        |WM.4M01004A05     |TOUCHEUR            |
    |08/06/2013        |WM.4M010108-1     |LEVER               |
    ------------------------------------------------------------
    "
    

    My go at a function

    f <- function(x=x6, header=TRUE, rem.dup.header=header, 
      na.strings=c("NA", "N/A"), stringsAsFactors=FALSE, ...) {
    
        # read each row as a character string
        x <- scan(text=x, what="character", sep="\n", quiet=TRUE)
    
        # keep only lines containing alphanumerics
        x <- x[grep("[[:alnum:]]", x)]
    
        # remove vertical bars with trailing or leading space
        x <- gsub("\\|? | \\|?", " ", x)
    
        # remove vertical bars at beginning and end of string
        x <- gsub("\\|?$|^\\|?", "", x)
    
        # remove vertical box-drawing characters
        x <- gsub("\U2502|\U2503|\U2505|\U2507|\U250A|\U250B", " ", x)
    
        if (rem.dup.header) {
            dup.header <- x == x[1]
            dup.header[1] <- FALSE
            x <- x[!dup.header]
        }
    
        # read the result as a table
        read.table(text=paste(x, collapse="\n"), header=header, 
          na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...)    
    }
    
    
    lapply(c(x1, x2, x3, x4, x5, x6), f)
    

    Output

    [[1]]
            Date Emp1 Case Priority PriorityCountinLast7days
    1 2018-06-01    A   A1        0                        0
    2 2018-06-03    A   A2        0                        1
    3 2018-06-02    B   B2        0                        2
    4 2018-06-03    B   B3        0                        3
    
    [[2]]
            Date Emp1 Case Priority PriorityCountinLast7days
    1 2018-06-01    A  A|1        0                        0
    2 2018-06-03    A  A|2        0                        1
    3 2018-06-02    B  B|2        0                        2
    4 2018-06-03    B  B|3        0                        3
    
    [[3]]
      Maths English Science History Class
    1   0.1     0.2     0.3     0.2    Y2
    2   0.9     0.5     0.7     0.4    Y1
    3   0.2     0.4     0.6     0.2    Y2
    4   0.9     0.5     0.2     0.7    Y1
    
    [[4]]
         Season  Team W AHWO
    1 2017/2018 TeamA 2 1.75
    2 2017/2018 TeamB 1 1.85
    3 2017/2018 TeamC 1 1.70
    4 2016/2017 TeamA 1 1.49
    5 2016/2017 TeamB 3 1.51
    6 2016/2017 TeamC 2   NA
    
    [[5]]
      A B C
    A 5 1 4
    B 2 5 3
    C 3 4 4
    
    [[6]]
            date      Material Description
    1 10/04/2013    WM.5597394   PNEUMATIC
    2 11/07/2013    GB.D040790        RING
    3 08/06/2013 WM.4M01004A05    TOUCHEUR
    4 08/06/2013 WM.4M010108-1       LEVER
    

    x3 is from here (will have to look at the edit history).
    x4 is from here
    x6 is from here

    0 讨论(0)
  • 2020-11-28 09:19

    Using data.table::fread:

    x = '
    +------------+------+------+----------+--------------------------+
    |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
    +------------+------+------+----------+--------------------------+
    | 2018-06-01 | A    | A1   |        0 |                        0 |
    | 2018-06-03 | A    | A2   |        0 |                        1 |
    | 2018-06-03 | A    | A3   |        0 |                        2 |
    | 2018-06-03 | A    | A4   |        1 |                        1 |
    | 2018-06-03 | A    | A5   |        2 |                        1 |
    | 2018-06-04 | A    | A6   |        0 |                        3 |
    | 2018-06-01 | B    | B1   |        0 |                        1 |
    | 2018-06-02 | B    | B2   |        0 |                        2 |
    | 2018-06-03 | B    | B3   |        0 |                        3 |
    +------------+------+------+----------+--------------------------+
    '
    
    fread(gsub('\\+.+\\n' ,'', x, perl = T), drop=c(1,7))
    
    #          Date Emp1 Case Priority PriorityCountinLast7days
    # 1: 2018-06-01    A   A1        0                        0
    # 2: 2018-06-03    A   A2        0                        1
    # 3: 2018-06-03    A   A3        0                        2
    # 4: 2018-06-03    A   A4        1                        1
    # 5: 2018-06-03    A   A5        2                        1
    # 6: 2018-06-04    A   A6        0                        3
    # 7: 2018-06-01    B   B1        0                        1
    # 8: 2018-06-02    B   B2        0                        2
    # 9: 2018-06-03    B   B3        0                        3
    

    The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

    0 讨论(0)
  • 2020-11-28 09:20

    As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:

    dput(mtcars  %>% head(10), file = 'reproducible.txt')
    

    The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).

    df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
    24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
    160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
    110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
    3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
    2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
    17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
    0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
    0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
    1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
    "drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
    "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
    "Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")
    
    0 讨论(0)
提交回复
热议问题