I recently typed an essay for my lit class, and my teacher specifically stated a word limit that does not include quotations from the piece. And I thought, why not make a sc
Depending on the requirements, could use The Greatest Regex Trick Ever
"[^"]*"|(\w+)
And count the matches of the first capture group.
\w+
matches one or more word characters.
See test at regex101.com
Also skip single quoted strings:
"[^"]*"|'[^']*'|(\w+)
test at regex101
A general solution would be pretty tough, since some works will have multi-paragraph quotes, where the first paragraph doesn't close the quote, but the second paragraph opens with a quotation mark. So matching quote marks document-wide would be hard.
On the other hand, you could maybe go paragraph-by-paragraph, and accumulate a non-quote word count for each paragraph. There would still be pathalogical cases that could break this (like a paragraph which includes a list of punctuation symbols, including a quotation mark), of course.
In Perl, assuming a getWordCount sub exists somewhere, and assuming you've somehow split your document into an array of paragraphs called @paragraphs, this might look like:
my $wordCount = 0;
foreach my $paragraph (@paragraphs) {
$paragraph =~ s/\".*?\"/g; # remove all quotation marks which have a matching quotation mark
$paragraph =~ s/\".*$/g; # remove quotation marks which go to the end of the paragraph
$wordCount += getWordCount($paragraph);
}
print "There are $wordCount words outside of quotations, maybe!";
This is easy enough using PCRE (or Perl of course):
".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+
Use the g
modifier, and s
if you want to handle multiline quotes.
Demo
Here's the x
version for readability:
".*?" (*SKIP)(?!)
| (?<!\w)'.*?'(?!\w) (*SKIP)(?!)
| [\w]+
The first part will match everything inside "
or '
quotes and will discard it ((*SKIP)(?!)
). The second part will match all words (I've included '
as being part of a word in this example). The '
character will be counted as a quote boundary only at start/end of words, to let you use things like isn't for instance.
Possible modifications:
[\w']+
with \w+
. [\w']+
with [-\w']+
.You get the point ;)
And here's a full Perl script that uses this regex:
#!/usr/bin/env perl
use strict;
use warnings;
$_ = do { local $/; <> };
print scalar (() = /".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+/gs), "\n";
Execute it passing in a file or STDIN containing the text you want to count the words in, and it will output the word count on STDOUT.
It would work better this way:
Total Number of characters - Sum(characters inside quotes)
You can use this regex to find all "Quoted" strings: \"[^"]*\"