问题
I'm trying to analyze a large data set for listings on Airbnb
and in the amenities
column, it lists out the amenities that the listing has.
For example,
{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire
extinguisher",Essentials,Shampoo,Hangers}
and
{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in
building",Heating,"Suitable for events","Smoke detector","Carbon monoxide
detector","First aid kit",Essentials,Shampoo,"Lock on bedroom
door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation
missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
I have two questions to be solved:
I would like to split the string into different columns, e.g. there will be a column with a title
TV
. If the string containsTV
, the entry in the corresponding cell will be 1 and 0 otherwise. How can I do this?How to remove the variable which contains
translation missing:.....
?
回答1:
I believe this would be a fast solution to the problem:
library(data.table)
setDT(df)
dcast(df, listing_id~amenities)
回答2:
Is it the Boston Airbnb Open Data from Kaggle?
Here is one way. Not exactly pretty but seems to work:
The idea is to remove {
and }
, then use read_csv()
to parse strings.
Then, list unique amenities, and make a column for each:
library(dplyr)
library(readr)
listings <- read_csv(file = "../data/boston-airbnb-open-data/listings.csv")
parsed_amenities <-
listings %>%
.$amenities %>%
sub("^\\{(.*)\\}$", "\\1\n", x = .) %>%
lapply(function(x) names(read_csv(x)))
df <-
unique(unlist(parsed_amenities)) %>%
.[!grepl("translation missing", .)] %>%
setNames(., .) %>%
lapply(function(x) vapply(parsed_amenities, "%in%", logical(1), x = x)) %>%
as_data_frame()
df
# # A tibble: 3,585 × 43
# TV `Wireless Internet` Kitchen `Free Parking on Premises` `Pets live on this property` `Dog(s)` Heating
# <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
# 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# 2 TRUE TRUE TRUE FALSE TRUE TRUE TRUE
# 3 TRUE TRUE TRUE TRUE FALSE FALSE TRUE
# 4 TRUE TRUE TRUE TRUE FALSE FALSE TRUE
# 5 FALSE TRUE TRUE FALSE FALSE FALSE TRUE
# 6 FALSE TRUE TRUE TRUE TRUE FALSE TRUE
# 7 TRUE TRUE TRUE TRUE FALSE FALSE TRUE
# 8 TRUE TRUE FALSE TRUE TRUE TRUE TRUE
# 9 FALSE TRUE FALSE FALSE TRUE FALSE TRUE
# 10 TRUE TRUE TRUE TRUE FALSE FALSE TRUE
# # ... with 3,575 more rows, and 36 more variables: `Family/Kid Friendly` <lgl>, Washer <lgl>, Dryer <lgl>, `Smoke
# # Detector` <lgl>, `Fire Extinguisher` <lgl>, Essentials <lgl>, Shampoo <lgl>, `Laptop Friendly Workspace` <lgl>,
# # Internet <lgl>, `Air Conditioning` <lgl>, `Pets Allowed` <lgl>, `Carbon Monoxide Detector` <lgl>, `Lock on Bedroom
# # Door` <lgl>, Hangers <lgl>, `Hair Dryer` <lgl>, Iron <lgl>, `Cable TV` <lgl>, `First Aid Kit` <lgl>, `Safety
# # Card` <lgl>, Gym <lgl>, Breakfast <lgl>, `Indoor Fireplace` <lgl>, `Cat(s)` <lgl>, `24-Hour Check-in` <lgl>, `Hot
# # Tub` <lgl>, `Buzzer/Wireless Intercom` <lgl>, `Other pet(s)` <lgl>, `Washer / Dryer` <lgl>, `Smoking
# # Allowed` <lgl>, `Suitable for Events` <lgl>, `Wheelchair Accessible` <lgl>, `Elevator in Building` <lgl>,
# # Pool <lgl>, Doorman <lgl>, `Paid Parking Off Premises` <lgl>, `Free Parking on Street` <lgl>
回答3:
Here is an approach which uses also dcast()
from the data.table
package as in this answer but addresses also the tedious but important details of data cleaning.
library(data.table)
# read data file, returning one column
raw <- fread("AirBnB.csv", header = FALSE, sep = "\n", col.names = "amenities")
# add column with row numbers
raw[, rn := seq_len(.N)]
# remove opening and closing curly braces
raw[, amenities := stringr::str_replace_all(amenities, "^\\{|\\}$", "")]
# split amenities, thereby reshaping from wide to long format
long <- raw[, strsplit(amenities, ",", fixed = TRUE), by = rn]
# remove double quotes and leading and trailing whitespace
long[, V1 := stringr::str_trim(stringr::str_replace_all(V1, '["]', ""))]
# reshape from long to wide format, omitting rows which contain "translation missing..."
dcast(long[!V1 %like% "^translation missing"], rn ~ V1, length, value.var = "rn", fill = 0)
# rn Air conditioning Carbon monoxide detector Elevator in building Essentials
#1: 1 1 0 0 1
#2: 2 1 1 1 1
# Fire extinguisher First aid kit Hair dryer Hangers Heating Iron Kitchen
#1: 1 0 0 1 1 0 1
#2: 0 1 1 1 1 1 1
# Laptop friendly workspace Lock on bedroom door Shampoo Smoke detector
#1: 0 0 1 0
#2: 1 1 1 1
# Suitable for events TV Wireless Internet
#1: 0 0 1
#2: 1 1 1
Data file
The OP has only provided two data samples which have been copied into a data file called "AirBnB.csv"
:
{"Wireless Internet","Air conditioning",Kitchen,Heating,"Fire extinguisher",Essentials,Shampoo,Hangers}
{TV,"Wireless Internet","Air conditioning",Kitchen,"Elevator in building",Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
来源:https://stackoverflow.com/questions/42907589/how-to-split-a-string-into-different-variables