I have a list like this:
.NET
ABAP
Access
Account Management
Accounting
Active Directory
Agile Methodologies
Agile Project Management
AJAX
Algorithms
Analysis
An
Not word2vec, but have an alternative look at this post:
library(XML)
library(dplyr)
library(RecordLinkage)
df <- data.frame(words=capture.output(htmlParse("https://stackoverflow.com/questions/35904182/word2vec-for-text-mining-categories")[["//div/pre/code/text()"]]))
df %>% compare.dedup(strcmp = TRUE) %>%
epiWeights() %>%
epiClassify(0.8) %>%
getPairs(show = "links", single.rows = TRUE) -> matches
left_join(mutate(df,ID = 1:nrow(df)),
select(matches,id1,id2) %>% arrange(id1) %>% filter(!duplicated(id2)),
by=c("ID"="id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1) ) %>%
select(-id1) -> dfnew
head(dfnew, 30)
# words ID
# 1 .NET 1
# 2 ABAP 2
# 3 Access 3
# 4 Account Management 4 # <--
# 5 Accounting 4 # <--
# 6 Active Directory 6
# 7 Agile Methodologies 7 # <--
# 8 Agile Project Management 7 # <--
# 9 AJAX 9
# 10 Algorithms 10
# 11 Analysis 11
# 12 Android 12 # <--
# 13 Android Development 12 # <--
# 14 AngularJS 14
# 15 Ant 15
# 16 Apache 16
# 17 ASP 17 # <--
# 18 ASP.NET 17 # <--
# 19 B2B 19
# 20 Banking 20
# 21 BPMN 21
# 22 Budgets 22
# 23 Business Analysis 23 # <--
# 24 Business Development 23 # <--
# 25 Business Intelligence 23 # <--
# 26 Business Planning 23 # <--
# 27 Business Process 23 # <--
# 28 Business Process Design 23 # <--
# 29 Business Process... 23 # <--
# 30 Business Strategy 23 # <--
dfnew$ID
may be your abstract category based on jaro-winkler string distances. May need some fine tuning though for your real data.