Extract text using regex in R

后端未结

关注

 5  1948

I read the text file with below data and am trying to convert it to a dataframe

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2021-01-25 02:45
              
            
            
                                                                       
Using the tidyverse package:

library(tidyverse)

text <- list(readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt"))

out <- tibble(text = text)

out <- out %>%
  rowwise() %>%
  mutate(ids = str_extract(text,"Id: .+") %>% na.omit() %>% str_remove("Id: ") %>% str_c(collapse = ", "),
         ASIN = str_extract(text,"ASIN: .+") %>% na.omit() %>% str_remove("ASIN: ") %>% str_c(collapse = ", "),
         title = str_extract(text,"title: .+") %>% na.omit() %>% str_remove("title: ") %>% str_c(collapse = ", "),
         group = str_extract(text,"group: .+") %>% na.omit() %>% str_remove("group: ") %>% str_c(collapse = ", "),
         similar = str_extract(text,"similar: .+") %>% na.omit() %>% str_remove("similar: ") %>% str_c(collapse = ", "),
         rating = str_extract(text,"avg rating: .+") %>% na.omit() %>% str_remove("avg rating: ") %>% str_c(collapse = ", ")
         ) %>%
  ungroup()


I put the text in a list because I assume that you will want to create a dataframe with more than one item being looked up. If you do just add a new list item for each readLines that you do. 

Notice that mutate looks at each item in the list as an object which is equivalent to using text[[1]]...

If you have and item occur more than once you'll need to add %>% str_c(collapse = ", ") like I have done, otherwise you can remove it. 

UPDATE based on new sample data:

The new sample dataset creates some different challenges that weren't addressed in my original answer. 

First, the data is all in a single file and I had assumed it would be in multiple files. It is possible to either separate everything into a list of lists, or to separate everything into a vector of characters. I chose the second option. 

Because I chose the second option I now have to update my code to extract data until a \r is reached (Need to \\r in R because of how R handles escapes). 

Next, some of the fields are empty! Have to add a check to see if the result is empty and fix the output if it is. I'm using %>% ifelse(length(.)==0,NA,.) to accomplish this.

Note: if you add other fields such as categories: to this search the code will only capture the first line of text. It will need to be modified to capture more than one line. 

library(tidyverse)

# Read text into a single long file.
text <- read_file("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt")

# Separate each Id: into a character string in a vector
# Use negative lookahead to capture groups that don't have Id: in them.
# Use an or to also capture any non-words that don't have Id: in them.
text <- str_extract_all(text,"Id: (((?!Id:).)|[^(Id:)])+") %>% 
  flatten()

out <- tibble(text = text)

out <- out %>%
  rowwise() %>%
  mutate(ids = str_extract(text,"Id: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("Id: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
         ASIN = str_extract(text,"ASIN: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("ASIN: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
         title = str_extract(text,"title: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("title: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
         group = str_extract(text,"group: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("group: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
         similar = str_extract(text,"similar: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("similar: \\d") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
         rating = str_extract(text,"avg rating: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("avg rating: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.)
  ) %>%
  ungroup()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2021-01-25 02:46
              
            
            
                                                                       
EDIT: Correcting my answer.

Using stringr:

library(stringr)

ids <- str_extract(text, 'Id:[ ]*\\S+')
ASIN <- str_extract(text, 'ASIN:[ ]*\\S+')
title <- str_extract(text, 'title:[ ]*\\S+')
group <- str_extract(text, 'group:[ ]*\\S+')
similar <- str_extract(text, 'similar:[ ]*\\S+')
rating <- str_extract(text, 'avg rating:[ ]*\\S+')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2021-01-25 02:47
              
            
            
                                                                       
Here is a different approach using separate_rows and spread to reformat the text file into a dataframe:

text = readLines(path_to_textfile)

library(dplyr)
library(tidyr)

data.frame(text = text) %>%
  separate_rows(text, sep = "(?<=\\d)\\s+(?=[a-z])") %>%
  extract(text, c("title", "value"), regex = "(?i)([a-z]+):(.+)") %>%
  filter(!title %in% c("reviews", "downloaded")) %>%
  group_by(title) %>%
  mutate(id = 1:n()) %>%
  spread(title, value) %>%
  select(-id)


Result:

         ASIN group   Id rating salesrank
1  0827229534  Book    1      5    396585
2    12412441  Book    2     10   4225352
                                                         similar
1  5  0804215715  156101074X  0687023955  0687074231  082721619X
2                                         1241242 1412414 124124
                                     title
1  Patterns of Preaching: A Sermon Sampler
2                                Patterns2


Data:

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group: Book
  salesrank: 396585
  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
  reviews: total: 2  downloaded: 2  avg rating: 5
Id:   2
ASIN: 12412441
  title: Patterns2
  group: Book
  salesrank: 4225352
  similar: 1241242 1412414 124124
  reviews: total: 2  downloaded: 2  avg rating: 10


Note:

Leave an extra blank row at the end of the text file. Otherwise readLines would return an error when attempting to read in the file.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2021-01-25 02:47
              
            
            
                                                                       
This is just a start. Since im not a pro in regExp I will let others do the magic. :)

Either you define the rules for every object and do something like this.

ids <- do.call(rbind, regmatches(regexec(pattern = 'Id:\\s+', text = text), x = text))
ASIN <- do.call(rbind, regmatches(regexec(pattern = 'ASIN:\\s+', text = text), x = text))
title <- do.call(rbind, regmatches(regexec(pattern = 'title:\\s+', text = text), x = text))


Or you define a general rule, which should work for every line. Something like this:

sapply(text,  FUN = function(x) {
  regmatches(x, regexec(text = x, pattern = "([^:]+)"))
  })

sapply(text,  FUN = function(x) {
  regmatches(x, regexec(text = x, pattern = "(:.*)"))
})

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一生所求        
                
              
                            
                2021-01-25 03:02
              
            
            
                                                                       
I am mostly using baseR here (apart from zoo and tiydr), may be little long code, but it can get the desired results.

options(stringsAsFactors = F)
text <- readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt") #Input file

textdf <- data.frame(text, stringsAsFactors = F) #Reading it
search_words <- c("Id","ASIN","title","group","salesrank","similar","avg rating") #search words as per OP
textdf <- data.frame(text = textdf[grepl(paste0(search_words,collapse = "|"), textdf$text),]) #finding the words and filtering it
textdf$key <- as.numeric(gsub("Id:\\s+(\\d+)","\\1",textdf$text))
View(textdf) # Making a key for each Id

textdf$key <- zoo::na.locf(textdf$key) #Propagating the key for same set of Ids
textdf$text <- gsub( "(.*)(?=avg rating:\\s*\\d+)","", textdf$text, perl=T) #Removing text from before "avg rating" 
textdf$text <- gsub("(similar:\\s*\\d+)(.*)","\\1", textdf$text, perl=T) #Removing text after "similar"
textdf$text <- trimws(textdf$text) ##removing leading and trailing blanks
textdf$text <- sub(":","+",textdf$text) #Replacing the first instance of : so that we can split with plus sign, since plus sign is very uncommon hence took it
splits <- strsplit(textdf$text, "\\+")  #Splitting 
max_len <- max(lengths(splits)) #checking for max length of items in the list
all_lyst_eq_len <- lapply(splits, `length<-`, max_len) #equaling the list
df_final <- data.frame(cbind(do.call('rbind', all_lyst_eq_len), textdf$key))# binding the data frame

df_final <- df_final[!duplicated(df_final),] #Removing the duplicates, there is some dups in data
df_f <- tidyr::spread(df_final, X1,X2) # Reshaping it(transposing)

df_f[,c("Id","ASIN", "title", "group","similar",
            "avg rating")] #Final dataset 


Output:

The text file is very wrapped up hence adding a screenshot , my apologies to community.

The output is ditto as per OP.


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复