问题
I have a data set of medicine names in a column. I am trying to extract the name ,strength and unit of each medicine from this data. The term MG and ML are the qualifiers of strength in the setup. For example, let us consider the following given data set for the names of the medicines.
Medicine name
----------------------
FALCAN 150 MG tab
AUGMENTIN 500MG tab
PRE-13 0.5 ML PFS inj
NS.9%w/v 250 ML, Glass Bottle
I want to extract the following information columns from this data set,
Name | Strength |Unit
---------| ---------|------
FALCAN | 150 |MG
AUGMENTIN| 500 |MG
PRE-13 | 0.5 |ML
NS.9%w/v | 250 |ML
I have tried grepl
etc command and could not find a good solution. I have around >12000 data to identify. Data does not follow a fixed pattern, and at few places MG and strength does not have a space in between such as 300MG. ,
回答1:
You can achieve this with multiple regular expressions. All thought I am not a regex champion I use it for the same purpose as you present here.
meds <- c('FALCAN 150 MG tab',
'AUGMENTIN 500MG tab',
'PRE-13 0.5 ML PFS inj',
'NS.9%w/v 250 ML, Glass Bottle')
library(stringr)
#Name
trimws(str_extract(str_extract(meds, '.* [0-9.]{3}'),'.* '))
#Strength
str_extract(str_extract(meds, '[0-9.]{3}( M|M)[GL]'),'[0-9.]*')
#Unit
str_extract(str_extract(meds, '( M|[0-9]M)[GL]'), 'M[GL]')
I know that a lot of these medicine notations can be quite different, thus I prefer to extract each item by regular expressions, in contrast to the solution presented by G. Grothendieck, who expects a certain structure in the data (3 columns).
That way I am able to tweak each item, by inspecting all the strings that generate NA
values.
回答2:
If the input L
is as given reproducibly in the Note at the end then use sub
to replace MG or ML and everything after with a space followed by MG or ML and then read it using read.table
:
s <- sub("(M[GL]).*", " \\1", L)
read.table(text = s, as.is = TRUE, skip = 1, col.names = c("Name", "Strength", "Unit"))
giving:
Name Strength Unit
1 FALCAN 150.0 MG
2 AUGMENTIN 500.0 MG
3 PRE-13 0.5 ML
4 NS.9%w/v 250.0 ML
Note: The input L
used is:
L <- c("Medicine name", " FALCAN 150 MG tab", " AUGMENTIN 500MG tab",
" PRE-13 0.5 ML PFS inj", " NS.9%w/v 250 ML, Glass Bottle")
回答3:
A <- trimws(strsplit('FALCAN 150 MG tab
AUGMENTIN 500MG tab
PRE-13 0.5 ML PFS inj
NS.9%w/v 250 ML, Glass Bottle',"\n")[[1]])
plyr::ldply(strsplit(A," "), function(i){
new <- gsub("[[:punct:]]$","",i)
Unit <- gsub("[0-9]","",new[grep("^([0-9]{1,})?[A-Z]{2}$", new)])
data.frame(
Name = i[[1]], Strength = gsub("[A-z]",'',i[[2]]),Unit= Unit,
stringsAsFactors = F
)
})
Name Strength Unit
1 FALCAN 150 MG
2 AUGMENTIN 500 MG
3 PRE-13 0.5 ML
4 NS.9%w/v 250 ML
来源:https://stackoverflow.com/questions/42675865/extracting-specific-data-from-text-column-in-r