Twitter Data Analysis - Error in Term Document Matrix

前端未结

关注

 6  761

Trying to do some analysis of twitter data. Downloaded the tweets and created a corpus from the text of the tweets using the below

# Creating a Corpus
wim_co


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2020-12-03 19:08
              
            
            
                                                                       
I had the same problem and it turns out it is an issue with package compatibility.  Try installing

install.packages("SnowballC")


and load with

library(SnowballC)


before calling DocumentTermMatrix.

It solved my problem.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2020-12-03 19:12
              
            
            
                                                                       
I think the error is due to some "exotic" characters within the tweet messages, which the tm function cannot handle. I'v got the same error using tweets as a corpus source.
Maybe the following workaround helps:

# Reading some tweet messages (here from a text file) into a vector

rawTweets <- readLines(con = "target_7_sample.txt", ok = TRUE, warn = FALSE, encoding = "utf-8") 


# Convert the tweet text explicitly into utf-8

convTweets <- iconv(rawTweets, to = "utf-8")


# The above conversion leaves you with vector entries "NA", i.e. those tweets that can't be handled. Remove the "NA" entries with the following command:

tweets <- (convTweets[!is.na(convTweets)])


If the deletion of some tweets is not an issue for your solution (e.g. build a word cloud) then this approach may work, and you can proceed by calling the Corpus function of the tm package.

Regards--Albert
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暖寄归人        
                
              
                            
                2020-12-03 19:12
              
            
            
                                                                       
As Albert suggested, converting the text encoding to "utf-8" solved the problem for me. But instead of removing the whole tweet with problematic characters, you can use the sub option in iconv to only remove the "bad" characters in a tweet and keep the rest:

tweets <- iconv(rawTweets, to = "utf-8", sub="")


This does not produce NAs anymore and no further filtration step is necessary.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2020-12-03 19:12
              
            
            
                                                                       
there were some german umlaut letters and some special fonts that were causing the errors.
I could not remove them in R.. even by converting it to utf-8. (I am a new R user)
so I used excel to remove the german letters and then there were no errors after.. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  小蘑菇        
                
              
                            
                2020-12-03 19:22
              
            
            
                                                                       
I think this problem happens because of some weird characters appear in the text. Here is my solution:

wim_corpus = tm_map(wim_corpus, str_replace_all,"[^[:alnum:]]", " ")


tdm = TermDocumentMatrix(wim_corpus, 
                       control = list(removePunctuation = TRUE, 
                                      stopwords =  TRUE, 
                                      removeNumbers = TRUE, tolower = TRUE))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-03 19:23
              
            
            
                                                                       
I have found a way to solve this problem in an article about TM. 

An example in which the error follows below:

getwd()
require(tm)

# Importing files
files <- DirSource(directory = "texts/",encoding ="latin1" )

# loading files and creating a Corpus
corpus <- VCorpus(x=files)

# Summary

summary(corpus)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation)
matrix_terms <- DocumentTermMatrix(corpus)



Warning messages:
In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers



This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.

However, if you add one more command before using the function TermDocumentMatrix you can keep going.

Below follows the code with the new command:

getwd()
require(tm)  

files <- DirSource(directory = "texts/",encoding ="latin1" )

# loading files and creating a Corpus
corpus <- VCorpus(x=files)

# Summary 
summary(corpus)
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation)

# COMMAND TO CHANGE THE CLASS AND AVOID THIS ERROR
corpus <- Corpus(VectorSource(corpus))
matriz_terms <- DocumentTermMatrix(corpus)


Therefore, you won't have more problems with this.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复