R DocumentTermMatrix loses results less than 100

此生再无相见时 提交于 2019-12-08 05:58:09

问题


I'm trying to feed a corpus into DocumentTermMatrix (I shorthand as DTM) to get term frequencies, but I noticed that DTM doesn't keep all terms and I don't know why! Check it out:

A<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107")
B<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107")
C<-Corpus(VectorSource(c(A,B)))
inspect(C)

>A corpus with 2 text documents
>
>The metadata consists of 2 tag-value pairs and a data frame
>Available tags are:
>  create_date creator 
>Available variables in the data frame are:
>  MetaID 
>
>[[1]]
> 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107
>
>[[2]]
> 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107

So far so good.

But now, I try to feed C into the DTM and it doesn't come out the other end! See:

> dtm<-DocumentTermMatrix(C)
> colnames(dtm)
>[1] "100" "101" "102" "103" "106" "107" "108" "109" "110"

Where are all the results less than 100? Or is it somehow a 2 character thing? I also tried:

dtm<-DocumentTermMatrix(C,control=list(c(1,Inf)))

and

dtm<-TermDocumentMatrix(C,control=list(c(1,Inf)))

to no avail. What gives?


回答1:


If you read the ?TermDocumentMatrix help page you can see that additional control= options are listed in in the ?termFreq help page.

There is a wordLengths parameter which filters the length of the words used in the matrix. It defaults to c(3,Inf) so it excludes two-character words. Try setting the value to control=list(wordLengths=c(2,Inf) to include those short words. (Note that when passing control parameters, you should name the parameters in the list.)



来源:https://stackoverflow.com/questions/24388384/r-documenttermmatrix-loses-results-less-than-100

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!