Tidying datasets with multiple sections/headers at variable positions

前端 未结 4 1614
轻奢々
轻奢々 2021-01-22 16:33

Context

I am trying to read in and tidy an excel file with multiple headers/sections placed at variable positions. The content of these headers need to

相关标签:
4条回答
  • 2021-01-22 17:09

    For completeness' sake, here's a base R solution that also depends on the expectation that you can make a vector of the elements of col1 that are not city names and use it for reference:

    # make your vector of non-city elements of col1 for reference
    types <- c("Diesel","Gasoline","LPG","Electric")
    
    # use that reference vector to flag city names
    df$city = ifelse(!df$col1 %in% types, 1, 0)
    # use cumsum with that flag to create a group id
    df$group = cumsum(df$city) 
    
    # use the split/apply/combine approach, splitting on that group id, restructuring
    # each element of the resulting list as desired through lapply, then recombining 
    # the results with do.call and rbind
    newdf <- do.call(rbind, lapply(split(df, df$group), function(x) {
    
      data.frame(city = x$col1[1], type = x$col1, value = x$col2, stringsAsFactors = FALSE)[-1,]
    
    }))
    

    Result:

    > newdf
           city     type value
    1.2 Seattle   Diesel    80
    1.3 Seattle Gasoline    NA
    1.4 Seattle      LPG    10
    1.5 Seattle Electric    10
    2.2  Boston   Diesel    65
    2.3  Boston Gasoline    25
    2.4  Boston Electric    10
    
    0 讨论(0)
  • 2021-01-22 17:14

    Here is an option based on creating a group based on the us.cities dataset from maps by matching the elements in 'city' with the 'name' column from 'us.cities' to create a group, and then create the first element of 'col1' as 'city', delete the first row (slice(-1))

    library(maps)
    library(dplyr)
    library(stringr)
    df %>% 
       group_by(grp = cumsum(str_detect(col1,str_c("\\b(", 
            str_c(word(us.cities$name, 1), collapse="|"), ")\\b")))) %>% 
       mutate(city = first(col1)) %>% 
       slice(-1) %>% 
       ungroup %>% 
       select(city, type = col1, value = col2)
    # A tibble: 7 x 3
    #  city    type     value
    #  <fct>   <fct>    <dbl>
    #1 Seattle Diesel      80
    #2 Seattle Gasoline    NA
    #3 Seattle LPG         10
    #4 Seattle Electric    10
    #5 Boston  Diesel      65
    #6 Boston  Gasoline    25
    #7 Boston  Electric    10
    

    Or another option is using str_extract instead of grouping and then fill as in the other post

    df %>% 
       mutate(city = str_extract(col1, str_c("\\b(", 
         str_c(word(us.cities$name, 1), collapse="|"), ")\\b"))) %>% 
       fill(city) %>% 
       filter(col1 != city) %>% 
       select(city, type = col1, value = col2)
    

    NOTE: This would also work if there are 100s of other elements in 'col1' besides the 'city'. Here, we considered only the US cities, if it also includes cities from other countries, use world.cities data from the same package

    0 讨论(0)
  • 2021-01-22 17:15

    A data.table option.

    Similar to @camille's answer, I assume you can make some vector of measures and if the col1 value isn't in that list it's a city. This groups by the cumsum of not (!) col1 %in% meas, i.e. a group number which increments by 1 each time col1 is not found in meas. Within each group, city is set as the first value of col1 and col1/col2 are renamed appropriately. Then I filter to only rows where city doesn't equal col1 (now renamed type) and remove the grouping variable g.

    library(data.table)
    setDT(df)
    
    meas <- c("Diesel", "Gasoline", "LPG", "Electric")
    
    df[, .(city = first(col1), type = col1, value = col2), 
       by = .(g = cumsum(!col1 %in% meas))
      ][city != type, -'g']
    
    #       city     type value
    # 1: Seattle   Diesel    80
    # 2: Seattle Gasoline    NA
    # 3: Seattle      LPG    10
    # 4: Seattle Electric    10
    # 5:  Boston   Diesel    65
    # 6:  Boston Gasoline    25
    # 7:  Boston Electric    10
    
    0 讨论(0)
  • 2021-01-22 17:29

    Assuming you have a finite list of measures (diesel, electric, etc), you can make a list to check against. Any value of col1 not in that set of measures is presumably a city. Extract those (note that it's currently a factor, so I used as.character), fill down, and remove any heading rows.

    library(dplyr)
    
    meas <- c("Diesel", "Gasoline", "LPG", "Electric")
    
    df %>%
      mutate(city = ifelse(!col1 %in% meas, as.character(col1), NA)) %>%
      tidyr::fill(city) %>%
      filter(col1 != city)
    #>       col1 col2    city
    #> 1   Diesel   80 Seattle
    #> 2 Gasoline   NA Seattle
    #> 3      LPG   10 Seattle
    #> 4 Electric   10 Seattle
    #> 5   Diesel   65  Boston
    #> 6 Gasoline   25  Boston
    #> 7 Electric   10  Boston
    
    0 讨论(0)
提交回复
热议问题