Programmatically extract keywords from domain names

前端 未结 7 1243
余生分开走
余生分开走 2021-02-01 11:32

Let\'s say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don\'t see a particularly easy way to \"extract\" the keywords use

7条回答
  •  栀梦
    栀梦 (楼主)
    2021-02-01 11:48

    You need to develop a heuristic that will get likely matches out of the domain. The way I would do it is first find a large corpus of text. For example, you could download Wikipedia.

    Next take your corpus, and combine every two adjacent words. For example, if your sentence is:

    quick brown fox jumps over the lazy dog
    

    You'll create a list:

    quickbrown
    brownfox
    foxjumps
    jumpsover
    overthe
    thelazy
    lazydog
    

    Each of these would have a count of one. As you parse your corpus, you'll keep track of the frequency pairs of every two words. Additionally, for each pair, you'll need to sort what the original two words were.

    Sort this list by frequency, and then attempt to find matches in your domain based on these words.

    Lastly, do a domain check for the top two word phrases which aren't registered!

    I think the sites like DomainTool take a list of the highest ranking words. They then try to parse these words out first. Depending on the purpose, you may want to consider using MTurk to do the job. Different people will parse the same words differently, and might not do so in proportion to how common the words are.

提交回复
热议问题