text-mining | 易学教程

Matching a list of phrases to a corpus of documents and returning phrase frequency

阅读更多关于 Matching a list of phrases to a corpus of documents and returning phrase frequency

问题 I have a list of phrases and a corpus of documents.There are 100k+ phrases and 60k+ documents in the corpus. The phrases are might/might not present in the corpus. I'm looking forward to find the term frequency of each phrase present in the corpus. An example dataset: Phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning") Doc1 <- "If you're just starting with workout, begin slow." Doc2 <- "Don't jump in brain initial and

finding key phrases using tm package in r

阅读更多关于 finding key phrases using tm package in r

问题 I have a project requiring me to search annual reports of various companies and find key phrases in them. I have converted the reports to text files, created and cleaned a corpus. I then created a document term matrix. The tm_term_score function only seems to work for single words and not phrases. Is it possible to search the corpus for key phrases (not necessarily the most frequent)? For example - I want to see how many times the phrase “supply chain finance” in each document in the corpus.

How to find matching words in a DF from list of words and returning the matched words in new column [duplicate]

阅读更多关于 How to find matching words in a DF from list of words and returning the matched words in new column [duplicate]

问题 This question already has an answer here : Find matches of a vector of strings in another vector of strings (1 answer) Closed 2 years ago . I have a DF with 2 columns and I have a list of words. list_of_words <- c("tiger","elephant","rabbit", "hen", "dog", "Lion", "camel", "horse") df <- tibble::tibble(page=c(12,6,9,18,2,15,81,65), text=c("I have two pets: a dog and a hen", "lion and Tiger are dangerous animals", "I have tried to ride a horse", "Why elephants are so big in size", "dogs are

How to find matching words in a DF from list of words and returning the matched words in new column [duplicate]

阅读更多关于 How to find matching words in a DF from list of words and returning the matched words in new column [duplicate]

tm: read in data frame, keep text id's, construct DTM and join to other dataset

阅读更多关于 tm: read in data frame, keep text id's, construct DTM and join to other dataset

问题 I'm using package tm. Say I have a data frame of 2 columns, 500 rows. The first column is ID which is randomly generated and has both character and number in it: "txF87uyK" The second column is actual text : "Today's weather is good. John went jogging. blah, blah,..." Now I want to create a document-term matrix from this data frame. My problem is I want to keep the ID information so that after I got the document-term matrix, I can join this matrix with another matrix that has each row being

tm: read in data frame, keep text id's, construct DTM and join to other dataset

阅读更多关于 tm: read in data frame, keep text id's, construct DTM and join to other dataset

Uploading many files in Shiny

阅读更多关于 Uploading many files in Shiny

问题 I am developing an app that helps to organize and visualize many PDF documents by topic/theme. I can upload and read a single PDF but I have difficulty in reading multiple PDF documents. For single PDF document: ui.R --- fileInput('file1', 'Choose PDF File', accept=c('.pdf')) --- server.R -------- library(pdftools) ------- mypdf<-reactive({ inFile <- input$file1 if (is.null(inFile)){ return(NULL) }else{ pdf_text(inFile$datapath) } }) To upload multiple PDF files, I have to use multiple = TRUE

How to determine the (natural) language of a document?

阅读更多关于 How to determine the (natural) language of a document?

问题 I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in. Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is Java Not free for "semi-commercial" usage

extract text from google scholar

阅读更多关于 extract text from google scholar

问题 I am trying to extract the text from the test snippet that google scholar gives for a particular query. By text snippet I mean the text below the title (in black letter). Currently I am trying to extract it from the html file using python but it contains a lot of extra test such as /div><div class="gs_fl" ...etc. Is there a easy way or some code which can help me get the text without these redundant texts. 回答1: You need an html parser: import lxml.html doc = lxml.html.fromstring(html) text =

How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?

阅读更多关于 How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?

问题 I have a column called 'text' in my dataframe, where there is a lot of things written. I am trying to verify if in this column there is any of the strings from a list of patterns (e.g pattern1, pattern2, pattern3). I hope to create another boolean column stating if any of those patterns were found or not. But, an important thing is to match the pattern when there are little mistyping issues. For example, if in my list of patterns I have 'mickey' and 'mouse', I want it to match with 'm0use'