Regular expression to select a sentence with particular length

后端未结

关注

 2  1497

I need to extract a sentence from a block of text containing a particular word. This one I have:

[A-Z][^\\\\.;\\\\?\\\\!]*(word)[^\\\\.;\\\\?\\\\!]*

相关标签:

2条回答

盖世英雄少女心

2021-01-29 11:15

I assume you are using English texts to parse.

You can use NLP library to split text into sentences, and then only take those that contain word and are of specific length. I used Earnest Hemingway biography excerpt from the Wikipedia, and used the word "1970" to extract and then applied a second grep to only have a length-restricted value.

> require(tm)
> require(openNLP)
> text <- as.String("Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939. In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript. The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans. The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war. Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights. For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway. Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75.")
> sentence_token_annotator <- Maxent_Sent_Token_Annotator()
> sentence.boundaries <- annotate(text, sentence_token_annotator)
> sentences <- text[sentence.boundaries]
> sentences
[1] "Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939."                                                                                                                                   
[2] "In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript."                                                                                                                                                                      
[3] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[4] "The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war."                                                                                     
[5] "Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights."                                                                                                                
[6] "For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway."                                                          
[7] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                                                                                                                                        
> with_word = grep("1940", sentences, fixed = TRUE, value = TRUE)
> with_word
[1] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[2] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                                                                                                                                        
> with_word[grep("^.{30,100}$", with_word)]
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."

In your case, use your own word and {30,250} limiting quantifier to get just those sentences you need.

Note that it is possible to grep the sentences you need with 1 operation, but you will need a more complex PCRE regex with a lookahead:

> my_sent <- grep("(?s)(?=.{30,100}$).*1940.*$", sentences, value = TRUE, perl = TRUE)
> my_sent
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."

The "(?s)(?=.{30,100}$).*1940.*$" regex will require the string to have 30 to 100 (set your own limits) characters from start to end, and the string should contain 1940 word (note that if your word contains special regex metacharacters, they must be escaped with \\).

Just tested with your data:

> with_word = grep("(?s)^(?=.{30,250}$).*\\bhosted\\b.*$", sentences, perl = TRUE, value = TRUE)
> with_word
[1] "proudly hosted by Media Temple!"

0 讨论(0)

灰色年华

2021-01-29 11:19
You can use positive lookahead
```
(?=[\p{Any}]{30,250}.*)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...