Can regex match all the words outside quotation marks?

前端 未结 4 1546
执念已碎
执念已碎 2021-01-06 21:25

I recently typed an essay for my lit class, and my teacher specifically stated a word limit that does not include quotations from the piece. And I thought, why not make a sc

相关标签:
4条回答
  • 2021-01-06 21:41

    Depending on the requirements, could use The Greatest Regex Trick Ever

    "[^"]*"|(\w+)
    

    And count the matches of the first capture group.

    \w+ matches one or more word characters.

    See test at regex101.com


    Also skip single quoted strings:

    "[^"]*"|'[^']*'|(\w+)
    

    test at regex101

    0 讨论(0)
  • 2021-01-06 21:49

    A general solution would be pretty tough, since some works will have multi-paragraph quotes, where the first paragraph doesn't close the quote, but the second paragraph opens with a quotation mark. So matching quote marks document-wide would be hard.

    On the other hand, you could maybe go paragraph-by-paragraph, and accumulate a non-quote word count for each paragraph. There would still be pathalogical cases that could break this (like a paragraph which includes a list of punctuation symbols, including a quotation mark), of course.

    In Perl, assuming a getWordCount sub exists somewhere, and assuming you've somehow split your document into an array of paragraphs called @paragraphs, this might look like:

    my $wordCount = 0;
    foreach my $paragraph (@paragraphs) {
        $paragraph =~ s/\".*?\"/g; # remove all quotation marks which have a matching quotation mark
        $paragraph =~ s/\".*$/g; # remove quotation marks which go to the end of the paragraph
        $wordCount += getWordCount($paragraph);
    }
    print "There are $wordCount words outside of quotations, maybe!";
    
    0 讨论(0)
  • 2021-01-06 22:00

    This is easy enough using PCRE (or Perl of course):

    ".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+
    

    Use the g modifier, and s if you want to handle multiline quotes.

    Demo

    Here's the x version for readability:

      ".*?"              (*SKIP)(?!)
    | (?<!\w)'.*?'(?!\w) (*SKIP)(?!)
    | [\w]+
    

    The first part will match everything inside " or ' quotes and will discard it ((*SKIP)(?!)). The second part will match all words (I've included ' as being part of a word in this example). The ' character will be counted as a quote boundary only at start/end of words, to let you use things like isn't for instance.

    Possible modifications:

    • To count the text isn't as two words, replace [\w']+ with \w+.
    • To count text like mother-in-law as one word instead of 3, replace [\w']+ with [-\w']+.

    You get the point ;)

    And here's a full Perl script that uses this regex:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    
    $_ = do { local $/; <> };
    print scalar (() = /".*?"(*SKIP)(?!)|(?<!\w)'.*?'(?!\w)(*SKIP)(?!)|[\w']+/gs), "\n";
    

    Execute it passing in a file or STDIN containing the text you want to count the words in, and it will output the word count on STDOUT.

    0 讨论(0)
  • 2021-01-06 22:04

    It would work better this way:

    Total Number of characters - Sum(characters inside quotes)

    You can use this regex to find all "Quoted" strings: \"[^"]*\"

    0 讨论(0)
提交回复
热议问题