How to split a string into different variables?

问题

I'm trying to analyze a large data set for listings on Airbnb and in the amenities column, it lists out the amenities that the listing has.

For example,

{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire 
extinguisher",Essentials,Shampoo,Hangers}

and

{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in 
building",Heating,"Suitable for events","Smoke detector","Carbon monoxide 
detector","First aid kit",Essentials,Shampoo,"Lock on bedroom 
door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation 
missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}

I have two questions to be solved:

I would like to split the string into different columns, e.g. there will be a column with a title TV. If the string contains TV, the entry in the corresponding cell will be 1 and 0 otherwise. How can I do this?
How to remove the variable which contains translation missing:.....?

回答1:

I believe this would be a fast solution to the problem:

library(data.table)

setDT(df)

dcast(df, listing_id~amenities)

回答2:

Is it the Boston Airbnb Open Data from Kaggle?
Here is one way. Not exactly pretty but seems to work:

The idea is to remove { and }, then use read_csv() to parse strings.

Then, list unique amenities, and make a column for each:

library(dplyr)
library(readr)
listings <- read_csv(file = "../data/boston-airbnb-open-data/listings.csv")
parsed_amenities <-
  listings %>% 
  .$amenities %>% 
  sub("^\\{(.*)\\}$", "\\1\n", x = .) %>% 
  lapply(function(x) names(read_csv(x)))
df <-
  unique(unlist(parsed_amenities)) %>% 
  .[!grepl("translation missing", .)] %>% 
  setNames(., .) %>% 
  lapply(function(x) vapply(parsed_amenities, "%in%", logical(1), x = x)) %>% 
  as_data_frame()
df

# # A tibble: 3,585 × 43
#       TV `Wireless Internet` Kitchen `Free Parking on Premises` `Pets live on this property` `Dog(s)` Heating
#    <lgl>               <lgl>   <lgl>                      <lgl>                        <lgl>    <lgl>   <lgl>
# 1   TRUE                TRUE    TRUE                       TRUE                         TRUE     TRUE    TRUE
# 2   TRUE                TRUE    TRUE                      FALSE                         TRUE     TRUE    TRUE
# 3   TRUE                TRUE    TRUE                       TRUE                        FALSE    FALSE    TRUE
# 4   TRUE                TRUE    TRUE                       TRUE                        FALSE    FALSE    TRUE
# 5  FALSE                TRUE    TRUE                      FALSE                        FALSE    FALSE    TRUE
# 6  FALSE                TRUE    TRUE                       TRUE                         TRUE    FALSE    TRUE
# 7   TRUE                TRUE    TRUE                       TRUE                        FALSE    FALSE    TRUE
# 8   TRUE                TRUE   FALSE                       TRUE                         TRUE     TRUE    TRUE
# 9  FALSE                TRUE   FALSE                      FALSE                         TRUE    FALSE    TRUE
# 10  TRUE                TRUE    TRUE                       TRUE                        FALSE    FALSE    TRUE
# # ... with 3,575 more rows, and 36 more variables: `Family/Kid Friendly` <lgl>, Washer <lgl>, Dryer <lgl>, `Smoke
# #   Detector` <lgl>, `Fire Extinguisher` <lgl>, Essentials <lgl>, Shampoo <lgl>, `Laptop Friendly Workspace` <lgl>,
# #   Internet <lgl>, `Air Conditioning` <lgl>, `Pets Allowed` <lgl>, `Carbon Monoxide Detector` <lgl>, `Lock on Bedroom
# #   Door` <lgl>, Hangers <lgl>, `Hair Dryer` <lgl>, Iron <lgl>, `Cable TV` <lgl>, `First Aid Kit` <lgl>, `Safety
# #   Card` <lgl>, Gym <lgl>, Breakfast <lgl>, `Indoor Fireplace` <lgl>, `Cat(s)` <lgl>, `24-Hour Check-in` <lgl>, `Hot
# #   Tub` <lgl>, `Buzzer/Wireless Intercom` <lgl>, `Other pet(s)` <lgl>, `Washer / Dryer` <lgl>, `Smoking
# #   Allowed` <lgl>, `Suitable for Events` <lgl>, `Wheelchair Accessible` <lgl>, `Elevator in Building` <lgl>,
# #   Pool <lgl>, Doorman <lgl>, `Paid Parking Off Premises` <lgl>, `Free Parking on Street` <lgl>

回答3:

Here is an approach which uses also dcast() from the data.table package as in this answer but addresses also the tedious but important details of data cleaning.

library(data.table)

# read data file, returning one column
raw <- fread("AirBnB.csv", header = FALSE, sep = "\n", col.names = "amenities")
# add column with row numbers
raw[, rn := seq_len(.N)]
# remove opening and closing curly braces
raw[, amenities := stringr::str_replace_all(amenities, "^\\{|\\}$", "")]

# split amenities, thereby reshaping from wide to long format
long <- raw[, strsplit(amenities, ",", fixed = TRUE), by = rn]
# remove double quotes and leading and trailing whitespace
long[, V1 := stringr::str_trim(stringr::str_replace_all(V1, '["]', ""))]

# reshape from long to wide format, omitting rows which contain "translation missing..."
dcast(long[!V1 %like% "^translation missing"], rn ~ V1, length, value.var = "rn", fill = 0)
#   rn Air conditioning Carbon monoxide detector Elevator in building Essentials
#1:  1                1                        0                    0          1
#2:  2                1                        1                    1          1
#   Fire extinguisher First aid kit Hair dryer Hangers Heating Iron Kitchen
#1:                 1             0          0       1       1    0       1
#2:                 0             1          1       1       1    1       1
#   Laptop friendly workspace Lock on bedroom door Shampoo Smoke detector
#1:                         0                    0       1              0
#2:                         1                    1       1              1
#   Suitable for events TV Wireless Internet
#1:                   0  0                 1
#2:                   1  1                 1

Data file

The OP has only provided two data samples which have been copied into a data file called "AirBnB.csv":

{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire extinguisher",Essentials,Shampoo,Hangers}
{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in building",Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}

来源：https://stackoverflow.com/questions/42907589/how-to-split-a-string-into-different-variables

标签

data-analysis

data-cleaning