Programmatically extract keywords from domain names

前端未结

关注

 7  1232

余生分开走 2021-02-01 11:32

Let\'s say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don\'t see a particularly easy way to \"extract\" the keywords use

7条回答

孤街浪徒 (楼主)

2021-02-01 11:38

Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.

For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.

Below are the results, I saved the top three for each combination.

expertsexchange: 97 possibilities
 -  experts exchange -23.71
 -  expert sex change -31.46
 -  experts ex change -33.86

penisland: 11 possibilities
 -  pen island -20.54
 -  penis land -22.64
 -  pen is land -25.06

choosespain: 28 possibilities
 -  choose spain -21.17
 -  chooses pain -23.06
 -  choose spa in -29.41

kidsexpress: 15 possibilities
 -  kids express -23.56
 -  kid sex press -32.65
 -  kids ex press -34.98

childrenswear: 34 possibilities
 -  children swear -19.85
 -  childrens wear -25.26
 -  child ren swear -32.70

dicksonweb: 8 possibilities
 -  dickson web -27.09
 -  dick son web -30.51
 -  dicks on web -33.63

0 讨论(0)

查看其它7个回答