Regular Expression for accurate word-count using JavaScript

I'm trying to put together a regular expression for a JavaScript command that accurately counts the number of words in a textarea.

One solution I had found is as follows:

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\b\w+\b/).length -1;

But this doesn't count any non-Latin characters (eg: Cyrillic, Hangul, etc); it skips over them completely.

Another one I put together:

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\s+/g).length -1;

But this doesn't count accurately unless the document ends in a space character. If a space character is appended to the value being counted it counts 1 word even with an empty document. Furthermore, if the document begins with a space character an extraneous word is counted.

Is there a regular expression I can put into this command that counts the words accurately, regardless of input method?

This should do what you're after:

value.match(/\S+/g).length;

Rather than splitting the string, you're matching on any sequence of non-whitespace characters.

There's the added bonus of being easily able to extract each word if needed ;)

Try to count anything that is not whitespace and with a word boundary:

value.split(/\b\S+\b/g).length

You could also try to use unicode ranges, but I am not sure if the following one is complete:

value.split(/[\u0080-\uFFFF\w]+/g).length

For me this gave the best results:

value.split(/\b\W+\b/).length

with

var words = value.split(/\b\W+\b/)

you get all words.

Explanation:

\b is a word boundary
\W is a NON-word character, capital usually means the negation
'+' means 1 or more characters or the prefixed character class

I recommend learning regular expressions. It's a great skill to have because they are so powerful. ;-)

The correct regexp would be /s+/ in order to discard non-words:

'Lorem ipsum dolor , sit amet'.split(/\S+/g).length
7
'Lorem ipsum dolor , sit amet'.split(/\s+/g).length
6

you could extend/change you methods like this

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\b\(.*?)\b/).length -1; if you want to match things like email-addresses as well

and

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.trim().split(/\s+/g).length -1;

also try using \s as its the \w for unicode

source:http://www.regular-expressions.info/charclass.html

Try

    value.match(/\w+/g).length;

This will match a string of characters that can be in a word. Whereas something like:

    value.match(/\S+/g).length;

will result in an incorrect count if the user adds commas or other punctuation that is not followed by a space - or adds a comma with a space either side of it.

my simple JavaScript library, called FuncJS has a function called "count()" which does exactly what it's called — count words.

For example, say that you have a string full of words, you can simply place it in between the function brackets, like this:

count("How many words are in this string?");

and then call the function, which will then return the number of words. Also, this function is designed to ignore any amount of whitespace, thus giving an accurate result.

To learn more about this function, please read the documentation at http://docs.funcjs.webege.com/count().html and the download link for FuncJS is also on the page.

Hope this helps anyone wanting to do this! :)

If JavaScript understands punctuation class [[:punct:]] and a lookahead assertion (?=)
then this should get all the words:

/[\s[:punct:]]*(\w(?:\w|[[:punct:]](?=[\w[:punct:]]))*)/

or, if you don't have the (?:) construct ...

/[\s[:punct:]]*(\w(\w|[[:punct:]](?=[\w[:punct:]]))*)/

Using this in Perl would go like this:

# Extracting and count the number of words
#
use strict;
use warnings;

my $text = q(
  I confirm that sufficient information and detail have been
  reported in this technical report, that it's "scientifically" sound,
  and that appropriate conclusion's have been included
);

my $regex = qr/ [\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* ) /x;
my $wordcount = 0;

while ( $text =~ /$regex/g )
{
    print "$1\n";
    $wordcount++;
}

print "\n", '-'x20, "\nFound $wordcount words\n\n";

Output:

I
confirm
that
sufficient
information
and
detail
have
been
reported
in
this
technical
report
that
it's
scientifically
sound
and
that
appropriate
conclusion's
have
been
included

--------------------
Found 25 words

来源：https://stackoverflow.com/questions/4593565/regular-expression-for-accurate-word-count-using-javascript

标签

javascript

regex

word-count