Context
I am trying to read in and tidy an excel file with multiple headers/sections placed at variable positions. The content of these headers need to
For completeness' sake, here's a base R solution that also depends on the expectation that you can make a vector of the elements of col1
that are not city names and use it for reference:
# make your vector of non-city elements of col1 for reference
types <- c("Diesel","Gasoline","LPG","Electric")
# use that reference vector to flag city names
df$city = ifelse(!df$col1 %in% types, 1, 0)
# use cumsum with that flag to create a group id
df$group = cumsum(df$city)
# use the split/apply/combine approach, splitting on that group id, restructuring
# each element of the resulting list as desired through lapply, then recombining
# the results with do.call and rbind
newdf <- do.call(rbind, lapply(split(df, df$group), function(x) {
data.frame(city = x$col1[1], type = x$col1, value = x$col2, stringsAsFactors = FALSE)[-1,]
}))
Result:
> newdf
city type value
1.2 Seattle Diesel 80
1.3 Seattle Gasoline NA
1.4 Seattle LPG 10
1.5 Seattle Electric 10
2.2 Boston Diesel 65
2.3 Boston Gasoline 25
2.4 Boston Electric 10
Here is an option based on creating a group based on the us.cities
dataset from maps
by matching the elements in 'city' with the 'name' column from 'us.cities' to create a group, and then create the first
element of 'col1' as 'city', delete the first row (slice(-1)
)
library(maps)
library(dplyr)
library(stringr)
df %>%
group_by(grp = cumsum(str_detect(col1,str_c("\\b(",
str_c(word(us.cities$name, 1), collapse="|"), ")\\b")))) %>%
mutate(city = first(col1)) %>%
slice(-1) %>%
ungroup %>%
select(city, type = col1, value = col2)
# A tibble: 7 x 3
# city type value
# <fct> <fct> <dbl>
#1 Seattle Diesel 80
#2 Seattle Gasoline NA
#3 Seattle LPG 10
#4 Seattle Electric 10
#5 Boston Diesel 65
#6 Boston Gasoline 25
#7 Boston Electric 10
Or another option is using str_extract
instead of grouping and then fill
as in the other post
df %>%
mutate(city = str_extract(col1, str_c("\\b(",
str_c(word(us.cities$name, 1), collapse="|"), ")\\b"))) %>%
fill(city) %>%
filter(col1 != city) %>%
select(city, type = col1, value = col2)
NOTE: This would also work if there are 100s of other elements in 'col1' besides the 'city'. Here, we considered only the US cities, if it also includes cities from other countries, use world.cities
data from the same package
A data.table option.
Similar to @camille's answer, I assume you can make some vector of measures and if the col1
value isn't in that list it's a city. This groups by the cumsum
of not (!
) col1 %in% meas
, i.e. a group number which increments by 1 each time col1
is not found in meas
. Within each group, city
is set as the first
value of col1
and col1
/col2
are renamed appropriately. Then I filter to only rows where city
doesn't equal col1
(now renamed type
) and remove the grouping variable g
.
library(data.table)
setDT(df)
meas <- c("Diesel", "Gasoline", "LPG", "Electric")
df[, .(city = first(col1), type = col1, value = col2),
by = .(g = cumsum(!col1 %in% meas))
][city != type, -'g']
# city type value
# 1: Seattle Diesel 80
# 2: Seattle Gasoline NA
# 3: Seattle LPG 10
# 4: Seattle Electric 10
# 5: Boston Diesel 65
# 6: Boston Gasoline 25
# 7: Boston Electric 10
Assuming you have a finite list of measures (diesel, electric, etc), you can make a list to check against. Any value of col1
not in that set of measures is presumably a city. Extract those (note that it's currently a factor, so I used as.character
), fill down, and remove any heading rows.
library(dplyr)
meas <- c("Diesel", "Gasoline", "LPG", "Electric")
df %>%
mutate(city = ifelse(!col1 %in% meas, as.character(col1), NA)) %>%
tidyr::fill(city) %>%
filter(col1 != city)
#> col1 col2 city
#> 1 Diesel 80 Seattle
#> 2 Gasoline NA Seattle
#> 3 LPG 10 Seattle
#> 4 Electric 10 Seattle
#> 5 Diesel 65 Boston
#> 6 Gasoline 25 Boston
#> 7 Electric 10 Boston