text-mining

How to improve the performance when working with wikipedia data and huge no. of webpages?

拟墨画扇 提交于 2019-12-13 12:50:21
问题 I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump. To achieve this I've - Crawled & downloaded organisation's webpages. (~110,000) Created a dictionary of wikipedia ID and terms/title. (~40million records) Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies. For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based

r : Need content_transformer() called by tm_map() to change non-letters to spaces

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-13 10:20:23
问题 In the following code, any characters matching "/|@| \|") will be changed to a space. > library(tm) > toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) > docs <- tm_map(docs, toSpace, "/|@| \\|") What code would transform all non-letters to a space? (What goes where the xxxxx's are below.) It is very difficult to put all non-letters in a string... (Very long list, some non-printable, plus the escaping characters things.) So I'm doing the opposite of the above. >

how to replace one part in url by using R

别等时光非礼了梦想. 提交于 2019-12-13 09:39:30
问题 Currently I have the website http://www.amazon.com/Apple-generation-Tablet-processor-White/product-reviews/B0047DVWLW/ref=cm_cr_pr_btm_link_2?ie=UTF8&pageNumber=1&showViewpoints=0&sortBy=bySubmissionDateDescending I want to replace this part pageNumber=1 to be replaced with a sequence of numbers such as 1,2,3,.....n I know I need to use the paste function. But can do I locate this number and replace it? 回答1: You can use the parseQueryString function from the shiny package or parse_url and

Identify an english word as a thing or product?

对着背影说爱祢 提交于 2019-12-13 07:42:41
问题 Write a program with the following objective - be able to identify whether a word/phrase represents a thing/product. For example - 1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product. 2) "In a window regulator , especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing. Doing this tells me that the text is talking about a thing/product. as a

R: extract and paste keyword matches

会有一股神秘感。 提交于 2019-12-13 07:38:06
问题 I am new to R and have been struggling with this one. I want to create a new column, that checks if a set of any of words ("foo", "x", "y") exist in column 'text', then write that value in new column. I have a data frame that looks like this: a-> id text time username 1 "hello x" 10 "me" 2 "foo and y" 5 "you" 3 "nothing" 15 "everyone" 4 "x,y,foo" 0 "know" The correct output should be: a2 -> id text time username keywordtag 1 "hello x" 10 "me" x 2 "foo and y" 5 "you" foo,y 3 "nothing" 15

Extracting specific data from text column in R

点点圈 提交于 2019-12-13 07:02:30
问题 I have a data set of medicine names in a column. I am trying to extract the name ,strength and unit of each medicine from this data. The term MG and ML are the qualifiers of strength in the setup. For example, let us consider the following given data set for the names of the medicines. Medicine name ---------------------- FALCAN 150 MG tab AUGMENTIN 500MG tab PRE-13 0.5 ML PFS inj NS.9%w/v 250 ML, Glass Bottle I want to extract the following information columns from this data set, Name |

Extract emotions calculation for every row of a dataframe

↘锁芯ラ 提交于 2019-12-13 05:09:51
问题 I have a dataframe with rows of text. I would like to extract for each row of text a vector of specific emotion which will be a binary 0 is not exist this emotion or 1 is exist. Totally they are 5 emotions but I would like to have the 1 only for the emotion which seem to be the most. Example of what I have tried: library(tidytext) text = data.frame(id = c(11,12,13), text=c("bad movie","good movie","I think it would benefit religious people to see things like this, not just to learn about our

Combine corpora in tm 0.7.3

蓝咒 提交于 2019-12-13 03:54:11
问题 Using the text mining package tm for R, the following works in version 0.6.2, R version 3.4.3: library(tm) a = "This is the first document." b = "This is the second document." c = "This is the third document." d = "This is the fourth document." docs1 = VectorSource(c(a,b)) docs2 = VectorSource(c(c,d)) corpus1 = Corpus(docs1) corpus2 = Corpus(docs2) corpus3 = c(corpus1,corpus2) inspect(corpus3) <<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 4 However,

Is there any way to extract header and footer and title page of a PDF document?

梦想的初衷 提交于 2019-12-13 03:01:20
问题 I want to know if there is any package to detect and extrac the header and footer or title page from PDF document ? I am new in text mining using python and I want to know for example pdfminer.layout could help to find any text block in pdfs? 回答1: Apache Tika also does metadata extraction. You can also extract names, title/multiple-titles, date, number of pages, modified dates, and many more. import tika from tika import parser filename = "your file name here" parsedPDF = parser.from_file

Extract n Words Around Defined Term (Multicase)

╄→尐↘猪︶ㄣ 提交于 2019-12-13 02:04:09
问题 I have a vector of text string s, such as: Sentences <- c("I would have gotten the promotion, but TEST my attendance wasn’t good enough.Let me help you with your baggage.", "Everyone was busy, so I went to the movie alone. Two seats were vacant.", "TEST Rock music approaches at high velocity.", "I am happy to take your TEST donation; any amount will be greatly TEST appreciated.", "A purple pig and a green donkey TEST flew a TEST kite in the middle of the night and ended up sunburnt.", "Rock