I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).
- Return only words with an occurrence greater than X
- Return only words with a length greater than Y
- Ignore common terms like "and, is, the, etc"
- Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
- Return results in a collection/array
Extra Credit
- Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
Where 'too good to be true' would be the actual statement
Extra-Extra Credit
- Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?
Source text: http://sampsonresume.com/labs/c.txt
Answer Format
- It would be great to see the results of your code, output, in addition to how long the operation lasted.
GNU scripting
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr
Results:
7 be 6 to [...] 1 2. 1 -
With occurence greater than X:
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'
Return only words with a length greater than Y (put Y+1 dots in second grep):
sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
Perl in only 43 characters.
perl -MYAML -anE'$_{$_}++for@F;say Dump\%_'
Here is an example of it's use:
echo a a a b b c d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump \%_' --- a: 3 aa: 1 b: 2 c: 1 d: 1 e: 1
If you need to list only the lowercase versions, it requires two more characters.
perl -MYAML -anE'$_{lc$_}++for@F;say Dump\%_'
For it to work on the specified text requires 58 characters.
curl http://sampsonresume.com/labs/c.txt | perl -MYAML -F'\W+' -anE'$_{lc$_}++for@F;END{say Dump\%_}'
real 0m0.679s user 0m0.304s sys 0m0.084s
Here is the last example expanded a bit.
#! perl use 5.010; use YAML; while( my $line = <> ){ for my $elem ( split '\W+', $line ){ $_{ lc $elem }++ } END{ say Dump \%_; } }
F#: 304 chars
let f = let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"] fun length occurrence msg -> System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+") |> Seq.countBy (fun a -> a) |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
Ruby
When "minified", this implementation becomes 165 characters long. It uses array#inject
to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.
Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.
Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.
Implementation
CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's) def get_keywords(text, minFreq=0, minLen=2) text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i). inject(Hash.new(0)) do |result,w| w.downcase! result[w] += 1 unless CommonWords.include?(w) result end.select { |k,n| n >= minFreq } end
Test Rig
require 'net/http' keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3) keywords.sort.each { |name,count| puts "#{name} x #{count} times" }
Test Results
code x 4 times declarations x 4 times each x 3 times execution x 3 times expression x 4 times function x 5 times keywords x 3 times language x 3 times languages x 3 times new x 3 times operators x 4 times programming x 3 times statement x 7 times statements x 4 times such x 3 times types x 3 times variables x 3 times which x 4 times
C# 3.0 (with LINQ)
Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.
public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength) { var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its"}; var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' '); var occurrences = words.Distinct().Except(commonWords).Select(w => new { Word = w, Count = words.Count(s => s == w) }); return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength) .ToDictionary(wo => wo.Word, wo => wo.Count); }
This is however far from the most efficient method, being O(n^2)
with the number of words, rather than O(n)
, which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.
Here are the results of the function run on the sample text (min occurences: 3, min length: 2).
3 x such 4 x code 4 x which 4 x declarations 5 x function 4 x statements 3 x new 3 x types 3 x keywords 7 x statement 3 x language 3 x expression 3 x execution 3 x programming 4 x operators 3 x variables
And my test program:
static void Main(string[] args) { string sampleText; using (var client = new WebClient()) sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt"); var keywords = GetKeywords(sampleText, 3, 2); foreach (var entry in keywords) Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key); Console.ReadKey(true); }
#! perl use strict; use warnings; while (<>) { for my $word (split) { $words{$word}++; } } for my $word (keys %words) { print "$word occurred $words{$word} times."; }
That's the simple form. If you want sorting, filtering, etc.:
while (<>) { for my $word (split) { $words{$word}++; } } for my $word (keys %words) { if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) { print "$word occurred $words{$word} times."; } }
You can also sort the output pretty easily:
... for my $word (keys %words) { if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) { push @output, "$word occurred $words{$word} times."; } } $re = qr/occurred (\d+) /; print sort { $a = $a =~ $re; $b = $b =~ $re; $a <=> $b } @output;
A true Perl hacker will easily get these on one or two lines each, but I went for readability.
Edit: this is how I would rewrite this last example
... for my $word ( sort { $words{$a} <=> $words{$b} } keys %words ){ next unless length($word) >= $MINLEN; last unless $words{$word) >= $MIN_OCCURRENCE; print "$word occurred $words{$word} times."; }
Or if I needed it to run faster I might even write it like this:
for my $word_data ( sort { $a->[1] <=> $b->[1] # numerical sort on count } grep { # remove values that are out of bounds length($_->[0]) >= $MINLEN && # word length $_->[1] >= $MIN_OCCURRENCE # count } map { # [ word, count ] [ $_, $words{$_} ] } keys %words ){ my( $word, $count ) = @$word_data; print "$word occurred $count times."; }
It uses map for efficiency, grep to remove extra elements, and sort to do the sorting, of course. ( it does so it in that order )
This is a slight variant of the Schwartzian transform.
Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.
x=3;y=2;W="and is the as of to or in for by an be may has can its".split() from itertools import groupby as gb d=dict((w,l)for w,l in((w,len(list(g)))for w,g in gb(sorted(open("c.txt").read().lower().split()))) if l>x and len(w)>y and w not in W)
A much longer version with plenty of comments for you reading pleasure:
# High and low count boundaries. x = 3 y = 2 # Common words string split into a list by spaces. Words = "and is the as of to or in for by an be may has can its".split() # A special function that groups similar strings in a list into a # (string, grouper) pairs. Grouper is a generator of occurences (see below). from itertools import groupby # Reads the entire file, converts it to lower case and splits on whitespace # to create a list of words sortedWords = sorted(open("c.txt").read().lower().split()) # Using the groupby function, groups similar words together. # Since grouper is a generator of occurences we need to use len(list(grouper)) # to get the word count by first converting the generator to a list and then # getting the length of the list. wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords)) # Filters the words by number of occurences and common words using yet another # list comprehension. filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y) # Creates a dictionary from the list of tuples. result = dict(filteredWordCounts) print result
The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.
Results:
{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}
C# code:
IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y) { // common words, that will be ignored var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word); // regular expression to find quoted text var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled); return // remove quoted text (it will be processed later) regex.Replace(text, "") // remove case dependency .ToLower() // split text by all these chars .Split(".,'\\/[]{}()`~@#$%^&*-=+?!;:<>| \n\r".ToCharArray()) // add quoted text .Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value)) // group words by the word and count them .GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count())) // apply filter(min word count and word length) and remove common words .Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key)); }
Output for ProcessText(text, 3, 2) call:
3 x languages 3 x such 4 x code 4 x which 3 x based 3 x each 4 x declarations 5 x function 4 x statements 3 x new 3 x types 3 x keywords 3 x variables 7 x statement 4 x expression 3 x execution 3 x programming 3 x operators
In C#:
Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.
Use LINQ, filter by length.
Use LINQ, filter with 'badwords'.Contains.
REBOL
Verbose, perhaps, so definitely not a winner, but gets the job done.
min-length: 0 min-count: 0 common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ] add-word: func [ word [string!] /local count letter non-letter temp rules match ][ ; Strip out punctuation temp: copy {} letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ] non-letter: complement letter rules: [ some [ copy match letter (append temp match) | non-letter ] ] parse/all word rules word: temp ; If we end up with nothing, bail if 0 == length? word [ exit ] ; Check length if min-length > length? word [ exit ] ; Ignore common words ignore: if find common-words word [ exit ] ; OK, its good. Add it. either found? count: select words word [ words/(word): count + 1 ][ repend words [word 1] ] ] rules: [ some [ {"} copy word to {"} (add-word word) {"} | copy word to { } (add-word word) { } ] end ] words: copy [] parse/all read %c.txt rules result: copy [] foreach word words [ if string? word [ count: words/:word if count >= min-count [ append result word ] ] ] sort result foreach word result [ print word ]
The output is:
act actions all allows also any appear arbitrary arguments assign assigned based be because been before below between braces branches break builtin but C C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better call called calls can care case char code columnbased comma Comments common compiler conditional consisting contain contains continue control controlflow criticized Cs curly brackets declarations define definitions degree delimiters designated directly dowhile each effect effects either enclosed enclosing end entry enum evaluated evaluation evaluations even example executed execution exert expression expressionExpressions expressions familiarity file followed following format FORTRAN freeform function functions goto has high However identified ifelse imperative include including initialization innermost int integer interleaved Introduction iterative Kernighan keywords label language languages languagesAlthough leave limit lineEach loop looping many may mimicked modify more most name needed new next nonstructured normal object obtain occur often omitted on operands operator operators optimization order other perhaps permits points programmers programming provides rather reinitialization reliable requires reserve reserved restrictions results return Ritchie say scope Sections see selects semicolon separate sequence sequence point sequential several side single skip sometimes source specify statement statements storage struct Structured structuresAs such supported switch syntax testing textlinebased than There This turn type types union Unlike unspecified use used uses using usually value values variable variables variety which while whitespace widespread will within writing
Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :
W="and is the as of to or in for by an be may has can its".split() x=3;y=2;d={} for l in open('c.txt') : for w in l.lower().translate(None,',.;\'"!()[]{}').split() : if w not in W: d[w]=d.get(w,0)+1 for w,n in d.items() : if n>y and len(w)>x : print n,w
output :
4 code 3 keywords 3 languages 3 execution 3 each 3 language 4 expression 4 statements 3 variables 7 statement 5 function 4 operators 4 declarations 3 programming 4 which 3 such 3 types
Here is my variant, in PHP:
$str = implode(file('c.txt')); $tok = strtok($str, " .,;()\r\n\t"); $splitters = '\s.,\(\);?:'; // string splitters $array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE ); foreach($array as $key) { $res[$key] = $res[$key]+1; } $splitters = '\s.,\(\)\{\};?:'; // string splitters $array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE ); foreach($array as $key) { $res[$key] = $res[$key]+1; } unset($res['the']); unset($res['and']); unset($res['to']); unset($res['of']); unset($res['by']); unset($res['a']); unset($res['as']); unset($res['is']); unset($res['in']); unset($res['']); arsort($res); //var_dump($res); // concordance foreach ($res AS $word => $rarity) echo $word . ' <b>x</b> ' . $rarity . '<br/>'; foreach ($array as $word) { // words longer than n (=5) // if(strlen($word) > 5)echo $word.'<br/>'; }
And output:
statement x 7 be x 7 C x 5 may x 5 for x 5 or x 5 The x 5 as x 5 expression x 4 statements x 4 code x 4 function x 4 which x 4 an x 4 declarations x 3 new x 3 execution x 3 types x 3 such x 3 variables x 3 can x 3 languages x 3 operators x 3 end x 2 programming x 2 evaluated x 2 functions x 2 definitions x 2 keywords x 2 followed x 2 contain x 2 several x 2 side x 2 most x 2 has x 2 its x 2 called x 2 specify x 2 reinitialization x 2 use x 2 either x 2 each x 2 all x 2 built-in x 2 source x 2 are x 2 storage x 2 than x 2 effects x 1 including x 1 arguments x 1 order x 1 even x 1 unspecified x 1 evaluations x 1 operands x 1 interleaved x 1 However x 1 value x 1 branches x 1 goto x 1 directly x 1 designated x 1 label x 1 non-structured x 1 also x 1 enclosing x 1 innermost x 1 loop x 1 skip x 1 There x 1 within x 1 switch x 1 Expressions x 1 integer x 1 variety x 1 see x 1 below x 1 will x 1 on x 1 selects x 1 case x 1 executed x 1 based x 1 calls x 1 from x 1 because x 1 many x 1 widespread x 1 familiarity x 1 C's x 1 mimicked x 1 Although x 1 reliable x 1 obtain x 1 results x 1 needed x 1 other x 1 syntax x 1 often x 1 Introduction x 1 say x 1 Programming x 1 Language x 1 C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1 Ritchie x 1 Kernighan x 1 been x 1 criticized x 1 For x 1 example x 1 care x 1 more x 1 leave x 1 return x 1 call x 1 && x 1 || x 1 entry x 1 include x 1 next x 1 before x 1 sequence point x 1 sequence x 1 points x 1 comma x 1 operator x 1 but x 1 compiler x 1 requires x 1 programmers x 1 exert x 1 optimization x 1 object x 1 This x 1 permits x 1 high x 1 degree x 1 occur x 1 Structured x 1 using x 1 struct x 1 union x 1 enum x 1 define x 1 Declarations x 1 file x 1 contains x 1 Function x 1 turn x 1 assign x 1 perhaps x 1 Keywords x 1 char x 1 int x 1 Sections x 1 name x 1 variable x 1 reserve x 1 usually x 1 writing x 1 type x 1 Each x 1 line x 1 format x 1 rather x 1 column-based x 1 text-line-based x 1 whitespace x 1 arbitrary x 1 FORTRAN x 1 77 x 1 free-form x 1 allows x 1 restrictions x 1 Comments x 1 C99 x 1 following x 1 // x 1 until x 1 */ x 1 /* x 1 appear x 1 between x 1 delimiters x 1 enclosed x 1 braces x 1 supported x 1 if x 1 -else x 1 conditional x 1 Unlike x 1 reserved x 1 sequential x 1 provides x 1 control-flow x 1 identified x 1 do-while x 1 while x 1 any x 1 omitted x 1 break x 1 continue x 1 expressions x 1 testing x 1 iterative x 1 looping x 1 separate x 1 initialization x 1 normal x 1 modify x 1 control x 1 structures x 1 As x 1 imperative x 1 single x 1 act x 1 sometimes x 1 curly brackets x 1 limit x 1 scope x 1 language x 1 uses x 1 evaluation x 1 assigned x 1 values x 1 To x 1 effect x 1 semicolon x 1 actions x 1 common x 1 consisting x 1 used x 1
var_dump
statement simply displays concordance. This variant preserves double-quoted expressions.
For supplied file this code finishes in 0.047 seconds. Though larger file will consume lots of memory (because of file
function).
This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).
In addition, I use to_S
from Lingua::EN::Inflect::Number to count only the singular forms of words.
You might also want to look at Lingua::CollinsParser.
#!/usr/bin/perl use strict; use warnings; use Lingua::EN::Inflect::Number qw( to_S ); use Lingua::StopWords qw( getStopWords ); use Text::ParseWords; my $stop = getStopWords('en'); my %words; while ( my $line = <> ) { chomp $line; next unless $line =~ /\S/; next unless my @words = parse_line(' ', 1, $line); ++ $words{to_S $_} for grep { length and not $stop->{$_} } map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc } @words; } print "=== only words appearing 4 or more times ===\n"; print "$_ : $words{$_}\n" for sort { $words{$b} <=> $words{$a} } grep { $words{$_} > 3 } keys %words; print "=== only words that are 12 characters or longer ===\n"; print "$_ : $words{$_}\n" for sort { $words{$b} <=> $words{$a} } grep { 11 < length } keys %words;
Output:
=== only words appearing 4 or more times === statement : 11 function : 7 expression : 6 may : 5 code : 4 variable : 4 operator : 4 declaration : 4 c : 4 type : 4 === only words that are 12 characters or longer === reinitialization : 2 control-flow : 1 sequence point : 1 optimization : 1 curly brackets : 1 text-line-based : 1 non-structured : 1 column-based : 1 initialization : 1
来源:https://stackoverflow.com/questions/1038252/code-golf-quickly-build-list-of-keywords-from-text-including-of-instances