Get index of each capture in a JavaScript regex

I want to match a regex like /(a).(b)(c.)d/ with "aabccde", and get the following information back:

"a" at index = 0
"b" at index = 2
"cc" at index = 3

How can I do this? String.match returns list of matches and index of the start of the complete match, not index of every capture.

Edit: A test case which wouldn't work with plain indexOf

regex: /(a).(.)/
string: "aaa"
expected result: "a" at 0, "a" at 2

Note: The question is similar to Javascript Regex: How to find index of each subexpression?, but I cannot modify the regex to make every subexpression a capturing group.

So, you have a text and a regular expression:

txt = "aabccde";
re = /(a).(b)(c.)d/;

The first step is to get the list of all substrings that match the regular expression:

subs = re.exec(txt);

Then, you can do a simple search on the text for each substring. You will have to keep in a variable the position of the last substring. I've named this variable cursor.

var cursor = subs.index;
for (var i = 1; i < subs.length; i++){
    sub = subs[i];
    index = txt.indexOf(sub, cursor);
    cursor = index + sub.length;


    console.log(sub + ' at index ' + index);
}

EDIT: Thanks to @nhahtdh, I've improved the mecanism and made a complete function:

String.prototype.matchIndex = function(re){
    var res  = [];
    var subs = this.match(re);

    for (var cursor = subs.index, l = subs.length, i = 1; i < l; i++){
        var index = cursor;

        if (i+1 !== l && subs[i] !== subs[i+1]) {
            nextIndex = this.indexOf(subs[i+1], cursor);
            while (true) {
                currentIndex = this.indexOf(subs[i], index);
                if (currentIndex !== -1 && currentIndex <= nextIndex)
                    index = currentIndex + 1;
                else
                    break;
            }
            index--;
        } else {
            index = this.indexOf(subs[i], cursor);
        }
        cursor = index + subs[i].length;

        res.push([subs[i], index]);
    }
    return res;
}


console.log("aabccde".matchIndex(/(a).(b)(c.)d/));
// [ [ 'a', 1 ], [ 'b', 2 ], [ 'cc', 3 ] ]

console.log("aaa".matchIndex(/(a).(.)/));
// [ [ 'a', 0 ], [ 'a', 1 ] ] <-- problem here

console.log("bababaaaaa".matchIndex(/(ba)+.(a*)/));
// [ [ 'ba', 4 ], [ 'aaa', 6 ] ]

Delus

I wrote MultiRegExp for this a while ago. As long as you don't have nested capture groups, it should do the trick. It works by inserting capture groups between those in your RegExp and using all the intermediate groups to calculate the requested group positions.

var exp = new MultiRegExp(/(a).(b)(c.)d/);
exp.exec("aabccde");

should return

{0: {index:0, text:'a'}, 1: {index:2, text:'b'}, 2: {index:3, text:'cc'}}

Live Version

I created a little regexp Parser which is also able to parse nested groups like a charm. It's small but huge. No really. Like Donalds hands. I would be really happy if someone could test it, so it will be battle tested. It can be found at: https://github.com/valorize/MultiRegExp2

Usage:

let regex = /a(?: )bc(def(ghi)xyz)/g;
let regex2 = new MultiRegExp2(regex);

let matches = regex2.execForAllGroups('ababa bcdefghixyzXXXX'));

Will output:
[ { match: 'defghixyz', start: 8, end: 17 },
  { match: 'ghi', start: 11, end: 14 } ]

Based on the ecma regular expression syntax I've written a parser respective an extension of the RegExp class which solves besides this problem (full indexed exec method) as well other limitations of the JavaScript RegExp implementation for example: Group based search & replace. You can test and download the implementation here (is as well available as NPM module).

The implementation works as follows (small example):

//Retrieve content and position of: opening-, closing tags and body content for: non-nested html-tags.
var pattern = '(<([^ >]+)[^>]*>)([^<]*)(<\\/\\2>)';
var str = '<html><code class="html plain">first</code><div class="content">second</div></html>';
var regex = new Regex(pattern, 'g');
var result = regex.exec(str);

console.log(5 === result.length);
console.log('<code class="html plain">first</code>'=== result[0]);
console.log('<code class="html plain">'=== result[1]);
console.log('first'=== result[3]);
console.log('</code>'=== result[4]);
console.log(5=== result.index.length);
console.log(6=== result.index[0]);
console.log(6=== result.index[1]);
console.log(31=== result.index[3]);
console.log(36=== result.index[4]);

I tried as well the implementation from @velop but the implementation seems buggy for example it does not handle backreferences correctly e.g. "/a(?: )bc(def(\1ghi)xyz)/g" - when adding paranthesis in front then the backreference \1 needs to be incremented accordingly (which is not the case in his implementation).

With RegExp.prototype.exec() and searching the properly indexes of the result:

let regex1 = /([a-z]+):([0-9]+)/g;
let str1 = 'hello:123';
let array1;
let resultArray = []

while ((array1 = regex1.exec(str1)) !== null) {
  const quantityFound = (Object.keys(array1).length - 3); // 3 default keys
  for (var i = 1; i<quantityFound; i++) { // start in 1 to avoid the complete found result 'hello:123'
    const found = array1[i];
    arraySingleResult = [found, str1.indexOf(found)];
    resultArray.push(arraySingleResult);
  }
}
console.log('result:', JSON.stringify(resultArray));

I'm not exactly sure exactly what your requirements are for your search, but here's how you could get the desired output in your first example using Regex.exec() and a while-loop.

JavaScript

var myRe = /^a|b|c./g;
var str = "aabccde";
var myArray;
while ((myArray = myRe.exec(str)) !== null)
{
  var msg = '"' + myArray[0] + '" ';
  msg += "at index = " + (myRe.lastIndex - myArray[0].length);
  console.log(msg);
}

Output

"a" at index = 0
"b" at index = 2
"cc" at index = 3

Using the lastIndex property, you can subtract the length of the currently matched string to obtain the starting index.

来源：https://stackoverflow.com/questions/15934353/get-index-of-each-capture-in-a-javascript-regex

标签

javascript

regex

capturing-group