word2vec for text mining categories

前端 未结 1 1278
囚心锁ツ
囚心锁ツ 2021-01-26 02:32

I have a list like this:

.NET
ABAP
Access
Account Management
Accounting
Active Directory
Agile Methodologies
Agile Project Management
AJAX
Algorithms
Analysis
An         


        
相关标签:
1条回答
  • 2021-01-26 02:54

    Not word2vec, but have an alternative look at this post:

    library(XML)
    library(dplyr)
    library(RecordLinkage)
    df <- data.frame(words=capture.output(htmlParse("https://stackoverflow.com/questions/35904182/word2vec-for-text-mining-categories")[["//div/pre/code/text()"]]))
    df %>% compare.dedup(strcmp = TRUE) %>%
                 epiWeights() %>%
                 epiClassify(0.8) %>%
                 getPairs(show = "links", single.rows = TRUE) -> matches
    left_join(mutate(df,ID = 1:nrow(df)), 
              select(matches,id1,id2) %>% arrange(id1) %>% filter(!duplicated(id2)), 
              by=c("ID"="id2")) %>%
        mutate(ID = ifelse(is.na(id1), ID, id1) ) %>%
        select(-id1) -> dfnew
    head(dfnew, 30)
    #                       words ID
    # 1                      .NET  1
    # 2                      ABAP  2
    # 3                    Access  3
    # 4        Account Management  4 # <--
    # 5                Accounting  4 # <--
    # 6          Active Directory  6
    # 7       Agile Methodologies  7 # <--
    # 8  Agile Project Management  7 # <--
    # 9                      AJAX  9
    # 10               Algorithms 10
    # 11                 Analysis 11
    # 12                  Android 12 # <--
    # 13      Android Development 12 # <--
    # 14                AngularJS 14
    # 15                      Ant 15
    # 16                   Apache 16
    # 17                      ASP 17 # <--
    # 18                  ASP.NET 17 # <--
    # 19                      B2B 19
    # 20                  Banking 20
    # 21                     BPMN 21
    # 22                  Budgets 22
    # 23        Business Analysis 23 # <--
    # 24     Business Development 23 # <--
    # 25    Business Intelligence 23 # <--
    # 26        Business Planning 23 # <--
    # 27         Business Process 23 # <--
    # 28  Business Process Design 23 # <--
    # 29      Business Process... 23 # <--
    # 30        Business Strategy 23 # <--
    

    dfnew$ID may be your abstract category based on jaro-winkler string distances. May need some fine tuning though for your real data.

    0 讨论(0)
提交回复
热议问题