Extracting specific data from text column in R

点点圈 提交于 2019-12-13 07:02:30

问题


I have a data set of medicine names in a column. I am trying to extract the name ,strength and unit of each medicine from this data. The term MG and ML are the qualifiers of strength in the setup. For example, let us consider the following given data set for the names of the medicines.

 Medicine name
----------------------
 FALCAN 150 MG tab
 AUGMENTIN 500MG tab
 PRE-13 0.5 ML PFS inj
 NS.9%w/v 250 ML, Glass Bottle

I want to extract the following information columns from this data set,

Name     | Strength |Unit
---------| ---------|------
FALCAN   | 150      |MG
AUGMENTIN| 500      |MG
PRE-13   | 0.5      |ML
NS.9%w/v | 250      |ML

I have tried grepl etc command and could not find a good solution. I have around >12000 data to identify. Data does not follow a fixed pattern, and at few places MG and strength does not have a space in between such as 300MG. ,


回答1:


You can achieve this with multiple regular expressions. All thought I am not a regex champion I use it for the same purpose as you present here.

meds <- c('FALCAN 150 MG tab',
'AUGMENTIN 500MG tab',
'PRE-13 0.5 ML PFS inj',
'NS.9%w/v 250 ML, Glass Bottle')

library(stringr)

#Name
trimws(str_extract(str_extract(meds, '.* [0-9.]{3}'),'.* '))

#Strength
str_extract(str_extract(meds, '[0-9.]{3}( M|M)[GL]'),'[0-9.]*')

#Unit
str_extract(str_extract(meds, '( M|[0-9]M)[GL]'), 'M[GL]')

I know that a lot of these medicine notations can be quite different, thus I prefer to extract each item by regular expressions, in contrast to the solution presented by G. Grothendieck, who expects a certain structure in the data (3 columns). That way I am able to tweak each item, by inspecting all the strings that generate NA values.




回答2:


If the input L is as given reproducibly in the Note at the end then use sub to replace MG or ML and everything after with a space followed by MG or ML and then read it using read.table:

s <- sub("(M[GL]).*", " \\1", L)
read.table(text = s, as.is = TRUE, skip = 1, col.names = c("Name", "Strength", "Unit"))

giving:

       Name Strength Unit
1    FALCAN    150.0   MG
2 AUGMENTIN    500.0   MG
3    PRE-13      0.5   ML
4  NS.9%w/v    250.0   ML

Note: The input L used is:

L <- c("Medicine name", " FALCAN 150 MG tab", " AUGMENTIN 500MG tab", 
" PRE-13 0.5 ML PFS inj", " NS.9%w/v 250 ML, Glass Bottle")



回答3:


A <- trimws(strsplit('FALCAN 150 MG tab
 AUGMENTIN 500MG tab
PRE-13 0.5 ML PFS inj
NS.9%w/v 250 ML, Glass Bottle',"\n")[[1]])

plyr::ldply(strsplit(A," "), function(i){
    new <- gsub("[[:punct:]]$","",i)
    Unit <- gsub("[0-9]","",new[grep("^([0-9]{1,})?[A-Z]{2}$", new)])
    data.frame(
        Name = i[[1]], Strength = gsub("[A-z]",'',i[[2]]),Unit= Unit,
        stringsAsFactors = F
    )
})

       Name Strength Unit
1    FALCAN      150   MG
2 AUGMENTIN      500   MG
3    PRE-13      0.5   ML
4  NS.9%w/v      250   ML


来源:https://stackoverflow.com/questions/42675865/extracting-specific-data-from-text-column-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!