问题
I have a data set that contained lat/long information for different point locations, and I would like to know which city and state are associated with each point.
Following this example I used the revgeocode
function from ggmap
to obtain a street address for each location, producing the data frame below:
df <- structure(list(PointID = c(1787L, 2805L, 3025L, 3027L, 3028L,
3029L, 3030L, 3031L, 3033L), Latitude = c(38.36648102, 36.19548585,
43.419774, 43.437222, 43.454722, 43.452643, 43.411949, 43.255479,
43.261464), Longitude = c(-76.4802046, -94.21554661, -87.960399,
-88.018333, -87.974722, -87.978542, -87.94149, -87.986433, -87.968612
), Address = structure(c(2L, 8L, 5L, 3L, 9L, 7L, 4L, 1L, 6L), .Label = c("13004 N Thomas Dr, Mequon, WI 53097, USA",
"2160 Turner Rd, Lusby, MD 20657, USA", "2805 County Rd Y, Saukville, WI 53080, USA",
"3701-3739 County Hwy W, Saukville, WI 53080, USA", "3907 Echo Ln, Saukville, WI 53080, USA",
"4823 W Bonniwell Rd, Mequon, WI 53097, USA", "5100-5260 County Rd I, Saukville, WI 53080, USA",
"7948 W Gibbs Rd, Springdale, AR 72762, USA", "River Park Rd, Saukville, WI 53080, USA"
), class = "factor")), row.names = c(NA, -9L), class = "data.frame", .Names = c("PointID",
"Latitude", "Longitude", "Address"))
I would like to use R to extract the city/state information from the full street address, and create two columns to store this information ("City" and "State).
I'm assuming the stringr
package is the way to go, but I'm not sure how to go about using it. The example above used the following code to extract the zip code (named "result" in that example). Their data set:
# ID Longitude Latitude result
# 1 311175 41.29844 -72.92918 16 Church Street South, New Haven, CT 06519, USA
# 2 292058 41.93694 -87.66984 1632 West Nelson Street, Chicago, IL 60657, USA
# 3 12979 37.58096 -77.47144 2077-2199 Seddon Way, Richmond, VA 23230, USA
And code to extract the zipcode:
library(stringr)
data$zipcode <- substr(str_extract(data$result," [0-9]{5}, .+"),2,6)
data[,-4]
Is it possible to easily modify the above code to get the city and state data?
回答1:
You can get the city and state using revgeocode()
itself:
df <- cbind(df,do.call(rbind,
lapply(1:nrow(df),
function(i)
revgeocode(as.numeric(
df[i,3:2]), output = "more")[c("administrative_area_level_1","locality")])))
df
# PointID Latitude Longitude Address
# 1 1787 38.36648 -76.48020 2160 Turner Rd, Lusby, MD 20657, USA
# 2 2805 36.19549 -94.21555 7948 W Gibbs Rd, Springdale, AR 72762, USA
# 3 3025 43.41977 -87.96040 3907 Echo Ln, Saukville, WI 53080, USA
# 4 3027 43.43722 -88.01833 2805 County Rd Y, Saukville, WI 53080, USA
# 5 3028 43.45472 -87.97472 River Park Rd, Saukville, WI 53080, USA
# 6 3029 43.45264 -87.97854 5100-5260 County Rd I, Saukville, WI 53080, USA
# 7 3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA
# 8 3031 43.25548 -87.98643 13004 N Thomas Dr, Mequon, WI 53097, USA
# 9 3033 43.26146 -87.96861 4823 W Bonniwell Rd, Mequon, WI 53097, USA
# administrative_area_level_1 locality
# 1 Maryland Lusby
# 2 Arkansas Springdale
# 3 Wisconsin Saukville
# 4 Wisconsin Saukville
# 5 Wisconsin Saukville
# 6 Wisconsin Saukville
# 7 Wisconsin Saukville
# 8 Wisconsin Mequon
# 9 Wisconsin Mequon
P.S. You can do everything (including getting the address or/and zip code) in one step. Just add "address"
or/and "postal_code"
to c("administrative_area_level_1","locality")
which is the list of variables that you want to extract.
回答2:
If you feel like using stringr, you can do this:
library(stringr)
library(data.table)
parse_address <- function(address){
address <- address %>%
str_split(",") %>%
.[[1]]
state <- address %>%
.[3] %>%
str_replace_all("[^A-Z]","")
zip <- address %>%
.[3] %>%
str_replace_all("[^0-9]","")
city <- address %>%
.[2] %>%
str_trim()
street <- address %>%
.[1] %>%
str_trim()
data.table(street, city, state, zip)
}
lapply(df$Address, parse_address) %>%
rbindlist
回答3:
1) sub Use sub
like this. No packages needed.
The regular expression matches the start (^) followed by the shortest string until a comma and space followed by the shortest string (representing the city) until another comma and space followed by two characters (representing the state), a space, 5 characters (representing the zip code), a comma, a space, USA and end of string. The matches to the parenthesized portions can be referenced via \1, \2 and \3 but within double quotes \ must be doubled.
If your zip codes are not all 5 digits try pat <- "^.*?, (.*?), (..) (.*), USA$"
instead.
pat <- "^.*?, (.*?), (..) (.....), USA$"
transform(df, City = sub(pat, "\\1", Address),
State = sub(pat, "\\2", Address),
Zip = sub(pat, "\\3", Address))
giving:
PointID Latitude Longitude Address City State Zip
1 1787 38.36648 -76.48020 2160 Turner Rd, Lusby, MD 20657, USA Lusby MD 20657
2 2805 36.19549 -94.21555 7948 W Gibbs Rd, Springdale, AR 72762, USA Springdale AR 72762
3 3025 43.41977 -87.96040 3907 Echo Ln, Saukville, WI 53080, USA Saukville WI 53080
4 3027 43.43722 -88.01833 2805 County Rd Y, Saukville, WI 53080, USA Saukville WI 53080
5 3028 43.45472 -87.97472 River Park Rd, Saukville, WI 53080, USA Saukville WI 53080
6 3029 43.45264 -87.97854 5100-5260 County Rd I, Saukville, WI 53080, USA Saukville WI 53080
7 3030 43.41195 -87.94149 3701-3739 County Hwy W, Saukville, WI 53080, USA Saukville WI 53080
8 3031 43.25548 -87.98643 13004 N Thomas Dr, Mequon, WI 53097, USA Mequon WI 53097
9 3033 43.26146 -87.96861 4823 W Bonniwell Rd, Mequon, WI 53097, USA Mequon WI 53097
2) read.pattern Another possibility is read.pattern
with the same pat
as above:
library(gsubfn)
cn <- c("City", "State", "Zip")
Address <- as.character(df$Address)
cbind(df, read.pattern(text = Address, pattern = pat, as.is = TRUE, col.names = cn))
来源:https://stackoverflow.com/questions/45723974/extracting-city-and-state-information-from-a-google-street-address