R stringR RegExp strategy for grouping like expressions without prior knowledge

问题

I've got a list of 50K+ part numbers. I need to group them by their Product Type. Part numbers are typically near each other in sequence, although they're not perfectly sequential. The product description is always similar, but does not follow optimum rules. Let me illustrate with the following table.

| PartNo | Description | ProductType |
|--------|-------------|-------------|
|A000443 |Water Bottle |    Water    |
|A000445 |Contain Water|    Water    |
|A000448 |WaterBotHold |    Water    |
|HRZ55   |Hershey_Bar  | Energy Bar  |
|RRB55   |Candy Energy | Energy Bar  |
|QMU55   |Bar Protein  | Energy Bar  |

I do not know the Product Types before hand. The stringR regular expression has to be smart enough to generate a product type from the parts description. I'm a rookie just making my way through R for Data Science and this seems achievable, although difficult.

How would you go about even starting this problem? What I'm actually working with is shown below. The expectation is that my stringR syntax will populate the ProductType column.

| PartNo | Description | ProductType |
|--------|-------------|-------------|
|A000443 |Water Bottle |             |
|A000445 |Contain Water|             |
|A000448 |WaterBotHold |             |
|HRZ55   |Hershey_Bar  |             |
|RRB55   |Candy Energy |             |
|QMU55   |Bar Protein  |             |

Here's the reproducible example to get the ball rolling.

library(tidyverse)
library(stringr)
df <- tribble(
  ~PartNo, ~Description, ~ProductType, 
  "A000443", "Water Bottle", "",
  "A000445", "Contain Water", "",
  "A000448", "WaterBotHold", "",
  "HRZ55", "Hershey_Bar", "",
  "RRB55", "Candy Energy", "",
  "QMU55", "Bar Protein", ""
)

回答1:

You can try stringr::str_extract. It works for multiple words which are separated by |.

Updated:

OP suggested that words to look up as ProductType is not known and those should be decided on basis of frequency of different words in Description column.

An option is to use qdap package to find frequencies of different words and select top n (say 2) words which will decide product type. The solution will be as:

library(stringr)
library(qdap)

#Find frequencies of different words
freq <- freq_terms(df$Description)

#Select top `n`. I have taken top 2 and create regex pattern 
word_to_search <- paste0(freq$WORD[1:2],collapse = "|")

df$ProductType <- str_extract(tolower(df$Description), word_to_search)
df
#    PartNo   Description ProductType
# 1 A000443  Water Bottle       water
# 2 A000445 Contain Water       water
# 3 A000448  WaterBotHold       water
# 4   HRZ55   Hershey_Bar         bar
# 5   RRB55  Candy Energy        <NA>    #Didn't match with Water/Bar
# 6   QMU55   Bar Protein         bar

Data:

df <- read.table(text = 
"PartNo  Description 
A000443 'Water Bottle' 
A000445 'Contain Water'
A000448 WaterBotHold 
HRZ55   Hershey_Bar  
RRB55   'Candy Energy' 
QMU55   'Bar Protein'",
stringsAsFactors = FALSE, header = TRUE)

来源：https://stackoverflow.com/questions/50281816/r-stringr-regexp-strategy-for-grouping-like-expressions-without-prior-knowledge

标签

regex

stringr