Fastest way to test for a minimum number of lines or tokens

纵然是瞬间 提交于 2019-12-25 09:13:34

问题


I have twice now found myself wanting to know whether a Javascript string has a minimum number of lines but not wanting to wastefully split the string to find out. Both times took excessive experimentation with regular expressions to realize that the solution was simple. This self-answered post is here to prevent me (and hopefully others) from having to re-figure this out.

More generally, I want to efficiently determine whether any given string has at least a specified number of tokens. I don't need to know exactly how many tokens the string has. The token can be a character, a substring, a substring matching a regular expression, or a delimited unit such as a word or a line.

Another SO question explored whether it was faster to split a string or do a global regex match, in order to count the lines in the string. Splitting was reported to be faster, at least given ample memory. Our question here is, if we only need to know whether the number of tokens equals or exceeds a minimum, can we make testing against regular expressions faster than string splitting in the general case?

Here are some of the mistakes I made trying to match a minimum number of lines -- at least 42 lines in this case:

/(^[\n]*){42}/m.test(stringToTest)
/(\n[^\n]*|[^\n]*){42}/.test(stringToTest)
/(\n[^\n]*|[^\n]*(?!\n)){42}/.test(stringToTest)

These expressions are apparently happy to match nothing 42 times. They return true for stringToTest = ''.


回答1:


The solution is to test for a series of token/non-token units rather than trying to test for the right count of delimited units or the right count of tokens. If the token is a delimiter and you want a minimum number of delimited units, require a count of token/non-token units equal to one less than the required number of units. As we'll see, this solution has surprising performance.

Minimum line count

This function checks for a minimum number of lines, where \n delimits lines rather than strictly ending them, allowing for an empty last line:

function hasMinLineCount(text, minLineCount) {
    if (minLineCount <= 1)
        return true; // always 1+ lines, though perhaps empty
    var r = new RegExp('([^\n]*\n){' + (minLineCount-1) + '}');
    return r.test(text);
}

Alternatively, \n can be assumed to end lines, rather than purely delimit them, making an exception for a non-empty last line. For example, "apple\npear\n" would be two lines, while "apple\npear\ngrape" would be three. The following function counts lines in this manner:

function hasMinLineCount(text, minLineCount) {
    var r = new RegExp('([^\n]*\n|[^\n]+$){' + minLineCount + '}');
    return r.test(text);
}

String delimiters and tokens

More generally, for any unit delimited by a string delimiter:

var _ = require('lodash');

function hasMinUnitCount(text, minUnitCount, unitDelim) {
    if (minUnitCount <= 1)
        return true; // always 1+ units, though perhaps empty
    var escDelim = _.escapeRegExp(unitDelim);
    var r = new RegExp('(.*?'+ escDelim +'){' + (minUnitCount-1) + '}');
    return r.test(text);
}

We can also test for the presence of a minimum number of string tokens:

var _ = require('lodash');

function hasMinTokenCount(text, minTokenCount, token) {
    var escToken = _.escapeRegExp(token);
    var r = new RegExp('(.*?'+ escToken +'){' + minTokenCount + '}');
    return r.test(text);
}

Regular expression delimiters and tokens

We can generalize further by allowing unit delimiters and tokens to include regular expression characters. Just make sure the delimiters or tokens can unambiguously occur back-to-back. Example regex delimiters include "<br */>" and "[|,]". These are strings, not RegExp objects.

function hasMinUnitCount(text, minUnitCount, unitDelimRegexStr) {
    if (minUnitCount <= 1)
        return true; // always 1+ units, though perhaps empty
    var r = new RegExp(
              '(.*?'+ unitDelimRegexStr +'){' + (minUnitCount-1) + '}');
    return r.test(text);
}

function hasMinTokenCount(text, minTokenCount, tokenRegexStr) {
    var r = new RegExp('(.*?'+ tokenRegexStr +'){' + minTokenCount + '}');
    return r.test(text);
}

Computational cost

The generic functions work because their regular expressions do non-greedy matching of characters (notice the .*?) up until the next delimiter or token. This is a computationally expensive process of looking ahead and backtracking, so these take a performance hit relative to more hardcoded expressions such as found in hasMinLineCount() above.

Let's revisit the initial question of whether we can outperform splitting a string with regex testing. Recall that our only objective is to test for a minimum number of lines. I used benchmark.js to test, and I assumed that we know that a plurality of lines is required. Here is the code:

var Benchmark = require('benchmark');
var suite = new Benchmark.Suite;

var line = "Go faster faster faster!\n";
var text = line.repeat(100);
var MIN_LINE_COUNT = 50;
var preBuiltBackingRegex = new RegExp('(.*?\n){'+ MIN_LINE_COUNT +'}');
var preBuiltNoBackRegex = new RegExp('([^\n]*\n){'+ MIN_LINE_COUNT +'}');

suite.add('split string', function() {
    if (text.split("\n").length >= MIN_LINE_COUNT)
        'has minimum lines';
})
.add('backtracking on-the-fly regex', function() {
    if (new RegExp('(.*?\n){'+ MIN_LINE_COUNT +'}').test(text))
        'has minimum lines';
})
.add('backtracking pre-built regex', function() {
    if (preBuiltBackingRegex.test(text))
        'has minimum lines';
})
.add('no-backtrack on-the-fly regex', function() {
    if (new RegExp('([^\n]*\n){'+ MIN_LINE_COUNT +'}').test(text))
        'has minimum lines';
})
.add('no-backtrack pre-built regex', function() {
    if (preBuiltNoBackRegex.test(text))
        'has minimum lines';
})
.on('cycle', function(event) {
    console.log(String(event.target));
})
.on('complete', function() {
    console.log('Fastest is ' + this.filter('fastest').map('name'));
})
.run({ 'async': true });

Here are the results of three runs:

split string x 263,260 ops/sec ±0.68% (85 runs sampled)
backtracking on-the-fly regex x 492,671 ops/sec ±1.01% (82 runs sampled)
backtracking pre-built regex x 607,033 ops/sec ±0.72% (87 runs sampled)
no-backtrack on-the-fly regex x 581,681 ops/sec ±0.77% (84 runs sampled)
no-backtrack pre-built regex x 723,075 ops/sec ±0.72% (89 runs sampled)
Fastest is no-backtrack pre-built regex

split string x 260,962 ops/sec ±0.82% (85 runs sampled)
backtracking on-the-fly regex x 502,410 ops/sec ±0.79% (84 runs sampled)
backtracking pre-built regex x 606,220 ops/sec ±0.67% (88 runs sampled)
no-backtrack on-the-fly regex x 578,193 ops/sec ±0.83% (86 runs sampled)
no-backtrack pre-built regex x 741,864 ops/sec ±0.68% (84 runs sampled)
Fastest is no-backtrack pre-built regex

split string x 262,266 ops/sec ±0.76% (87 runs sampled)
backtracking on-the-fly regex x 495,697 ops/sec ±0.82% (87 runs sampled)
backtracking pre-built regex x 608,178 ops/sec ±0.72% (88 runs sampled)
no-backtrack on-the-fly regex x 574,640 ops/sec ±0.92% (87 runs sampled)
no-backtrack pre-built regex x 739,629 ops/sec ±0.72% (86 runs sampled)
Fastest is no-backtrack pre-built regex

All of the regex tests are clearly faster than splitting the string to check the line count, even the backtracking ones. I think I'll do the regex testing.



来源:https://stackoverflow.com/questions/39554154/fastest-way-to-test-for-a-minimum-number-of-lines-or-tokens

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!