Can regex match all the words outside quotation marks?

前端未结

关注

 4  1546

I recently typed an essay for my lit class, and my teacher specifically stated a word limit that does not include quotations from the piece. And I thought, why not make a sc

相关标签:

4条回答

庸人自扰

2021-01-06 21:41
Depending on the requirements, could use The Greatest Regex Trick Ever
```
"[^"]*"|(\w+)
```
And count the matches of the first capture group.

\w+ matches one or more word characters.

See test at regex101.com

Also skip single quoted strings:
```
"[^"]*"|'[^']*'|(\w+)
```
test at regex101
0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2021-01-06 21:49
A general solution would be pretty tough, since some works will have multi-paragraph quotes, where the first paragraph doesn't close the quote, but the second paragraph opens with a quotation mark. So matching quote marks document-wide would be hard.

On the other hand, you could maybe go paragraph-by-paragraph, and accumulate a non-quote word count for each paragraph. There would still be pathalogical cases that could break this (like a paragraph which includes a list of punctuation symbols, including a quotation mark), of course.

In Perl, assuming a getWordCount sub exists somewhere, and assuming you've somehow split your document into an array of paragraphs called @paragraphs, this might look like:
```
my $wordCount = 0;
foreach my $paragraph (@paragraphs) {
    $paragraph =~ s/\".*?\"/g; # remove all quotation marks which have a matching quotation mark
    $paragraph =~ s/\".*$/g; # remove quotation marks which go to the end of the paragraph
    $wordCount += getWordCount($paragraph);
}
print "There are $wordCount words outside of quotations, maybe!";
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
离开以前

2021-01-06 22:00
This is easy enough using PCRE (or Perl of course):
```
".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+
```
Use the g modifier, and s if you want to handle multiline quotes.

Demo

Here's the x version for readability:
```
  ".*?"              (*SKIP)(?!)
| (?<!\w)'.*?'(?!\w) (*SKIP)(?!)
| [\w]+
```
The first part will match everything inside " or ' quotes and will discard it ((*SKIP)(?!)). The second part will match all words (I've included ' as being part of a word in this example). The ' character will be counted as a quote boundary only at start/end of words, to let you use things like isn't for instance.

Possible modifications:
- To count the text isn't as two words, replace [\w']+ with \w+.
- To count text like mother-in-law as one word instead of 3, replace [\w']+ with [-\w']+.
You get the point ;)

And here's a full Perl script that uses this regex:
```
#!/usr/bin/env perl
use strict;
use warnings;

$_ = do { local $/; <> };
print scalar (() = /".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+/gs), "\n";
```
Execute it passing in a file or STDIN containing the text you want to count the words in, and it will output the word count on STDOUT.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2021-01-06 22:04

It would work better this way:

Total Number of characters - Sum(characters inside quotes)

You can use this regex to find all "Quoted" strings: \"[^"]*\"

0 讨论(0)
发布评论:

提交评论
- 加载中...