Programmatically extract keywords from domain names

前端 未结 7 1239
余生分开走
余生分开走 2021-02-01 11:32

Let\'s say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don\'t see a particularly easy way to \"extract\" the keywords use

相关标签:
7条回答
  • 2021-02-01 11:33

    If you have a list of valid words, you can loop through your domain string, and try to cut off a valid word each time with a backtracking algorithm. If you managed to use up all words, you are finished. Be aware that the time-complexity of this is not optimal :)

    0 讨论(0)
  • 2021-02-01 11:34

    choosespain.com kidsexpress.com childrenswear.com dicksonweb.com

    Have fun (and a good lawyer) if you are going to try to parse the url with a dictionary.

    You might do better if you can find the same characters but separated by white space on their web site.

    Other possiblities: extract data from ssl certificate; query top level domain name server; Access the domain name server (TLD); or use one of the "whois" tools or services (just google "whois").

    0 讨论(0)
  • 2021-02-01 11:38

    Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.

    For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.

    Below are the results, I saved the top three for each combination.

    expertsexchange: 97 possibilities
     -  experts exchange -23.71
     -  expert sex change -31.46
     -  experts ex change -33.86
    
    penisland: 11 possibilities
     -  pen island -20.54
     -  penis land -22.64
     -  pen is land -25.06
    
    choosespain: 28 possibilities
     -  choose spain -21.17
     -  chooses pain -23.06
     -  choose spa in -29.41
    
    kidsexpress: 15 possibilities
     -  kids express -23.56
     -  kid sex press -32.65
     -  kids ex press -34.98
    
    childrenswear: 34 possibilities
     -  children swear -19.85
     -  childrens wear -25.26
     -  child ren swear -32.70
    
    dicksonweb: 8 possibilities
     -  dickson web -27.09
     -  dick son web -30.51
     -  dicks on web -33.63
    
    0 讨论(0)
  • 2021-02-01 11:42
    function getwords( $string ) {
        if( strpos($string,"xn--") !== false ) {
            return false;
        }
        $string = trim( str_replace( '-', '', $string ) );
        $pspell = pspell_new( 'en' );
        $check = array();
        $words = array();
        for( $j = 0; $j < ( strlen( $string ) - 5 ); $j++ ) {
            for( $i = 4; $i < strlen( $string ); $i++ ) {
                if( pspell_check( $pspell, substr( $string, $j, $i ) ) ) {
                    $check[$j]++;
                    $words[] = substr( $string, $j, $i );
                }
            }
        }
        $words = array_unique( $words );
        if( count( $check ) > 0 ) {
            return $words;
        }
        return false;
    }
    
    print_r( getwords( 'ilikecheesehotels' ) );
    
    Array
    (
        [0] => like
        [1] => cheese
        [2] => hotel
        [3] => hotels
    )
    

    as a simple start with pspell. you might want to compare results and see if you got the stemm of a words without the "s" at the end and merge them.

    0 讨论(0)
  • 2021-02-01 11:46

    You would have to use a dictionary engine against a domain entry to find valid words and the run that dictionary engine against the result to ensure the result is valid words.

    0 讨论(0)
  • 2021-02-01 11:48

    You need to develop a heuristic that will get likely matches out of the domain. The way I would do it is first find a large corpus of text. For example, you could download Wikipedia.

    Next take your corpus, and combine every two adjacent words. For example, if your sentence is:

    quick brown fox jumps over the lazy dog
    

    You'll create a list:

    quickbrown
    brownfox
    foxjumps
    jumpsover
    overthe
    thelazy
    lazydog
    

    Each of these would have a count of one. As you parse your corpus, you'll keep track of the frequency pairs of every two words. Additionally, for each pair, you'll need to sort what the original two words were.

    Sort this list by frequency, and then attempt to find matches in your domain based on these words.

    Lastly, do a domain check for the top two word phrases which aren't registered!

    I think the sites like DomainTool take a list of the highest ranking words. They then try to parse these words out first. Depending on the purpose, you may want to consider using MTurk to do the job. Different people will parse the same words differently, and might not do so in proportion to how common the words are.

    0 讨论(0)
提交回复
热议问题