Extracting a number following specific text in R

主宰稳场 提交于 2019-12-04 04:12:59

问题


I have a data frame which contains a column full of text. I need to capture the number (can potentially be any number of digits from most likely 1 to 4 digits in length) that follows a certain phrase, namely 'Floor Area' or 'floor area'. My data will look something like the following:

"A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift"
"Newbuild flat. Floor Area: 30 sq.m" 
"6 bed house with floor area 50 sqm, lot area 25 sqm"

If I try to extract just the number or if I look back from sqm I will sometimes get the lot area by mistake.If someone could help me with a lookahead regex or similar in stringr, I'd appreciate it. Regex is a weak point for me. Many thanks in advance.


回答1:


A common technique to extract a number before or after a word is to match all the string up to the word or number or number and word while capturing the number and then matching the rest of the string and replacing with the captured substring using sub:

# Extract the first number after a word:
as.integer(sub(".*?<WORD_OR_PATTERN_HERE>.*?(\\d+).*", "\\1", x))

# Extract the first number after a word:
as.integer(sub(".*?(\\d+)\\s*<WORD_OR_PATTERN_HERE>.*", "\\1", x))

NOTE: Replace \\d+ with \\d+(?:\\.\\d+)? to match int or float numbers. \\s* matches 0 or more whitespace in the second sub.

For the current scenario, a possible solution will look like

v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
as.integer(sub("(?i).*?\\bfloor area:?\\s*(\\d+).*", "\\1", v))
# [1] 50 30 50

See the regex demo.

You may also leverage a capturing mechanism with str_match from stringr and get the second column value ([,2]):

> library(stringr)
> v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
> as.integer(str_match(v, "(?i)\\bfloor area:?\\s*(\\d+)")[,2])
[1] 50 30 50

See the regex demo.

The regex matches:

  • (?i) - in a case-insensitive way
  • \\bfloor area:? - a whole word (\b is a word boundary) floor area followed by an optional : (one or zero occurrence, ?)
  • \\s* - zero or more whitespace
  • (\\d+) - Group 1 (will be in [,2]) capturing one or more digits

See R demo online




回答2:


The following regex may get you started:

[Ff]loor\s+[Aa]rea:?\s+(\d{1,4})

The DEMO.




回答3:


use following regex with Case Insensitive matching:

floor\s*area:?\s*(\d{1,4})



回答4:


You need lookbehind regex.

str_extract_all(x, "\\b[Ff]loor [Aa]rea:?\\s*\\K\\d+", perl=T)

or

str_extract_all(x, "(?i)\\bfloor area:?\\s*\\K\\d+", perl=T)

DEMO

Donno why the above code won't return anything. You may try sub also,

> sub(".*\\b[Ff]loor\\s+[Aa]rea:?\\s*(\\d+).*", "\\1", x)
[1] "50" "30" "50"



回答5:


text<- "A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift"

unique(na.omit(as.numeric(unlist(strsplit(unlist(text), "[^0-9]+")))))
# [1]  3 50

Hope this helped.



来源:https://stackoverflow.com/questions/35931351/extracting-a-number-following-specific-text-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!