Programmatically extract keywords from domain names

前端未结

关注

 7  1239

Let\'s say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don\'t see a particularly easy way to \"extract\" the keywords use

相关标签:

7条回答

盖世英雄少女心

2021-02-01 11:33

If you have a list of valid words, you can loop through your domain string, and try to cut off a valid word each time with a backtracking algorithm. If you managed to use up all words, you are finished. Be aware that the time-complexity of this is not optimal :)

0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2021-02-01 11:34

choosespain.com kidsexpress.com childrenswear.com dicksonweb.com

Have fun (and a good lawyer) if you are going to try to parse the url with a dictionary.

You might do better if you can find the same characters but separated by white space on their web site.

Other possiblities: extract data from ssl certificate; query top level domain name server; Access the domain name server (TLD); or use one of the "whois" tools or services (just google "whois").

0 讨论(0)
发布评论:

提交评论
- 加载中...

孤街浪徒

2021-02-01 11:38

Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.

For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.

Below are the results, I saved the top three for each combination.

expertsexchange: 97 possibilities
 -  experts exchange -23.71
 -  expert sex change -31.46
 -  experts ex change -33.86

penisland: 11 possibilities
 -  pen island -20.54
 -  penis land -22.64
 -  pen is land -25.06

choosespain: 28 possibilities
 -  choose spain -21.17
 -  chooses pain -23.06
 -  choose spa in -29.41

kidsexpress: 15 possibilities
 -  kids express -23.56
 -  kid sex press -32.65
 -  kids ex press -34.98

childrenswear: 34 possibilities
 -  children swear -19.85
 -  childrens wear -25.26
 -  child ren swear -32.70

dicksonweb: 8 possibilities
 -  dickson web -27.09
 -  dick son web -30.51
 -  dicks on web -33.63

0 讨论(0)

日久生厌

2021-02-01 11:42

function getwords( $string ) {
    if( strpos($string,"xn--") !== false ) {
        return false;
    }
    $string = trim( str_replace( '-', '', $string ) );
    $pspell = pspell_new( 'en' );
    $check = array();
    $words = array();
    for( $j = 0; $j < ( strlen( $string ) - 5 ); $j++ ) {
        for( $i = 4; $i < strlen( $string ); $i++ ) {
            if( pspell_check( $pspell, substr( $string, $j, $i ) ) ) {
                $check[$j]++;
                $words[] = substr( $string, $j, $i );
            }
        }
    }
    $words = array_unique( $words );
    if( count( $check ) > 0 ) {
        return $words;
    }
    return false;
}

print_r( getwords( 'ilikecheesehotels' ) );

Array
(
    [0] => like
    [1] => cheese
    [2] => hotel
    [3] => hotels
)

as a simple start with pspell. you might want to compare results and see if you got the stemm of a words without the "s" at the end and merge them.

0 讨论(0)

有刺的猬

2021-02-01 11:46

You would have to use a dictionary engine against a domain entry to find valid words and the run that dictionary engine against the result to ensure the result is valid words.

0 讨论(0)
发布评论:

提交评论
- 加载中...
栀梦

2021-02-01 11:48
You need to develop a heuristic that will get likely matches out of the domain. The way I would do it is first find a large corpus of text. For example, you could download Wikipedia.

Next take your corpus, and combine every two adjacent words. For example, if your sentence is:
```
quick brown fox jumps over the lazy dog
```
You'll create a list:
```
quickbrown
brownfox
foxjumps
jumpsover
overthe
thelazy
lazydog
```
Each of these would have a count of one. As you parse your corpus, you'll keep track of the frequency pairs of every two words. Additionally, for each pair, you'll need to sort what the original two words were.

Sort this list by frequency, and then attempt to find matches in your domain based on these words.

Lastly, do a domain check for the top two word phrases which aren't registered!

I think the sites like DomainTool take a list of the highest ranking words. They then try to parse these words out first. Depending on the purpose, you may want to consider using MTurk to do the job. Different people will parse the same words differently, and might not do so in proportion to how common the words are.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页