Let\'s say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don\'t see a particularly easy way to \"extract\" the keywords use
Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.
For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.
Below are the results, I saved the top three for each combination.
expertsexchange: 97 possibilities - experts exchange -23.71 - expert sex change -31.46 - experts ex change -33.86 penisland: 11 possibilities - pen island -20.54 - penis land -22.64 - pen is land -25.06 choosespain: 28 possibilities - choose spain -21.17 - chooses pain -23.06 - choose spa in -29.41 kidsexpress: 15 possibilities - kids express -23.56 - kid sex press -32.65 - kids ex press -34.98 childrenswear: 34 possibilities - children swear -19.85 - childrens wear -25.26 - child ren swear -32.70 dicksonweb: 8 possibilities - dickson web -27.09 - dick son web -30.51 - dicks on web -33.63