问题
I've got a list of 50K+ part numbers. I need to group them by their Product Type. Part numbers are typically near each other in sequence, although they're not perfectly sequential. The product description is always similar, but does not follow optimum rules. Let me illustrate with the following table.
| PartNo | Description | ProductType |
|--------|-------------|-------------|
|A000443 |Water Bottle | Water |
|A000445 |Contain Water| Water |
|A000448 |WaterBotHold | Water |
|HRZ55 |Hershey_Bar | Energy Bar |
|RRB55 |Candy Energy | Energy Bar |
|QMU55 |Bar Protein | Energy Bar |
I do not know the Product Types before hand. The stringR regular expression has to be smart enough to generate a product type from the parts description. I'm a rookie just making my way through R for Data Science and this seems achievable, although difficult.
How would you go about even starting this problem? What I'm actually working with is shown below. The expectation is that my stringR syntax will populate the ProductType column.
| PartNo | Description | ProductType |
|--------|-------------|-------------|
|A000443 |Water Bottle | |
|A000445 |Contain Water| |
|A000448 |WaterBotHold | |
|HRZ55 |Hershey_Bar | |
|RRB55 |Candy Energy | |
|QMU55 |Bar Protein | |
Here's the reproducible example to get the ball rolling.
library(tidyverse)
library(stringr)
df <- tribble(
~PartNo, ~Description, ~ProductType,
"A000443", "Water Bottle", "",
"A000445", "Contain Water", "",
"A000448", "WaterBotHold", "",
"HRZ55", "Hershey_Bar", "",
"RRB55", "Candy Energy", "",
"QMU55", "Bar Protein", ""
)
回答1:
You can try stringr::str_extract
. It works for multiple words which are separated by |
.
Updated:
OP suggested that words to look up as ProductType
is not known and those should be decided on basis of frequency of different words in Description
column.
An option is to use qdap
package to find frequencies of different words and select top n
(say 2) words which will decide product type. The solution will be as:
library(stringr)
library(qdap)
#Find frequencies of different words
freq <- freq_terms(df$Description)
#Select top `n`. I have taken top 2 and create regex pattern
word_to_search <- paste0(freq$WORD[1:2],collapse = "|")
df$ProductType <- str_extract(tolower(df$Description), word_to_search)
df
# PartNo Description ProductType
# 1 A000443 Water Bottle water
# 2 A000445 Contain Water water
# 3 A000448 WaterBotHold water
# 4 HRZ55 Hershey_Bar bar
# 5 RRB55 Candy Energy <NA> #Didn't match with Water/Bar
# 6 QMU55 Bar Protein bar
Data:
df <- read.table(text =
"PartNo Description
A000443 'Water Bottle'
A000445 'Contain Water'
A000448 WaterBotHold
HRZ55 Hershey_Bar
RRB55 'Candy Energy'
QMU55 'Bar Protein'",
stringsAsFactors = FALSE, header = TRUE)
来源:https://stackoverflow.com/questions/50281816/r-stringr-regexp-strategy-for-grouping-like-expressions-without-prior-knowledge