word2vec for text mining categories

前端未结

关注

 1  1278

I have a list like this:

.NET
ABAP
Access
Account Management
Accounting
Active Directory
Agile Methodologies
Agile Project Management
AJAX
Algorithms
Analysis
An


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  Happy的楠姐        
                
              
                            
                2021-01-26 02:54
              
            
            
                                                                       
Not word2vec, but have an alternative look at this post:

library(XML)
library(dplyr)
library(RecordLinkage)
df <- data.frame(words=capture.output(htmlParse("https://stackoverflow.com/questions/35904182/word2vec-for-text-mining-categories")[["//div/pre/code/text()"]]))
df %>% compare.dedup(strcmp = TRUE) %>%
             epiWeights() %>%
             epiClassify(0.8) %>%
             getPairs(show = "links", single.rows = TRUE) -> matches
left_join(mutate(df,ID = 1:nrow(df)), 
          select(matches,id1,id2) %>% arrange(id1) %>% filter(!duplicated(id2)), 
          by=c("ID"="id2")) %>%
    mutate(ID = ifelse(is.na(id1), ID, id1) ) %>%
    select(-id1) -> dfnew
head(dfnew, 30)
#                       words ID
# 1                      .NET  1
# 2                      ABAP  2
# 3                    Access  3
# 4        Account Management  4 # <--
# 5                Accounting  4 # <--
# 6          Active Directory  6
# 7       Agile Methodologies  7 # <--
# 8  Agile Project Management  7 # <--
# 9                      AJAX  9
# 10               Algorithms 10
# 11                 Analysis 11
# 12                  Android 12 # <--
# 13      Android Development 12 # <--
# 14                AngularJS 14
# 15                      Ant 15
# 16                   Apache 16
# 17                      ASP 17 # <--
# 18                  ASP.NET 17 # <--
# 19                      B2B 19
# 20                  Banking 20
# 21                     BPMN 21
# 22                  Budgets 22
# 23        Business Analysis 23 # <--
# 24     Business Development 23 # <--
# 25    Business Intelligence 23 # <--
# 26        Business Planning 23 # <--
# 27         Business Process 23 # <--
# 28  Business Process Design 23 # <--
# 29      Business Process... 23 # <--
# 30        Business Strategy 23 # <--


dfnew$ID may be your abstract category based on jaro-winkler string distances. May need some fine tuning though for your real data.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复