Get index of each capture in a JavaScript regex

后端 未结 7 709
一个人的身影
一个人的身影 2020-11-29 04:43

I want to match a regex like /(a).(b)(c.)d/ with \"aabccde\", and get the following information back:

\"a\" at index = 0
\"b\" at i         


        
相关标签:
7条回答
  • 2020-11-29 05:00

    I'm not exactly sure exactly what your requirements are for your search, but here's how you could get the desired output in your first example using Regex.exec() and a while-loop.

    JavaScript

    var myRe = /^a|b|c./g;
    var str = "aabccde";
    var myArray;
    while ((myArray = myRe.exec(str)) !== null)
    {
      var msg = '"' + myArray[0] + '" ';
      msg += "at index = " + (myRe.lastIndex - myArray[0].length);
      console.log(msg);
    }
    

    Output

    "a" at index = 0
    "b" at index = 2
    "cc" at index = 3
    

    Using the lastIndex property, you can subtract the length of the currently matched string to obtain the starting index.

    0 讨论(0)
  • 2020-11-29 05:04

    There is currently a proposal (stage 3) to implement this in native Javascript:

    RegExp Match Indices for ECMAScript

    ECMAScript RegExp Match Indicies provide additional information about the start and end indices of captured substrings relative to the start of the input string.

    ...We propose the adoption of an additional indices property on the array result (the substrings array) of RegExp.prototype.exec(). This property would itself be an indices array containing a pair of start and end indices for each captured substring. Any unmatched capture groups would be undefined, similar to their corresponding element in the substrings array. In addition, the indices array would itself have a groups property containing the start and end indices for each named capture group.

    Here's an example of how things would work:

    const re1 = /a+(?<Z>z)?/;
    
    // indices are relative to start of the input string:
    const s1 = "xaaaz";
    const m1 = re1.exec(s1);
    m1.indices[0][0] === 1;
    m1.indices[0][1] === 5;
    s1.slice(...m1.indices[0]) === "aaaz";
    
    m1.indices[1][0] === 4;
    m1.indices[1][1] === 5;
    s1.slice(...m1.indices[1]) === "z";
    
    m1.indices.groups["Z"][0] === 4;
    m1.indices.groups["Z"][1] === 5;
    s1.slice(...m1.indices.groups["Z"]) === "z";
    
    // capture groups that are not matched return `undefined`:
    const m2 = re1.exec("xaaay");
    m2.indices[1] === undefined;
    m2.indices.groups["Z"] === undefined;
    

    So, for the code in the question, we could do:

    const re = /(a).(b)(c.)d/;
    const str = 'aabccde';
    const result = re.exec(str);
    // indicies[0], like result[0], describes the indicies of the full match
    const matchStart = result.indicies[0][0];
    result.forEach((matchedStr, i) => {
      const [startIndex, endIndex] = result.indicies[i];
      console.log(`${matchedStr} from index ${startIndex} to ${endIndex} in the original string`);
      console.log(`From index ${startIndex - matchStart} to ${endIndex - matchStart} relative to the match start\n-----`);
    });
    

    Output:

    aabccd from index 0 to 6 in the original string
    From index 0 to 6 relative to the match start
    -----
    a from index 0 to 1 in the original string
    From index 0 to 1 relative to the match start
    -----
    b from index 2 to 3 in the original string
    From index 2 to 3 relative to the match start
    -----
    cc from index 4 to 6 in the original string
    From index 4 to 6 relative to the match start
    

    Keep in mind that the indicies array contains the indicies of the matched groups relative to the start of the string, not relative to the start of the match.


    The proposal is currently at stage 3, which indicates that the specification text is complete and everyone in TC39 who needs to approve it has done so - all that remains is for environments to start shipping it so that final tests can be done, and then it will be put into the official standard.

    0 讨论(0)
  • 2020-11-29 05:04

    So, you have a text and a regular expression:

    txt = "aabccde";
    re = /(a).(b)(c.)d/;
    

    The first step is to get the list of all substrings that match the regular expression:

    subs = re.exec(txt);
    

    Then, you can do a simple search on the text for each substring. You will have to keep in a variable the position of the last substring. I've named this variable cursor.

    var cursor = subs.index;
    for (var i = 1; i < subs.length; i++){
        sub = subs[i];
        index = txt.indexOf(sub, cursor);
        cursor = index + sub.length;
    
    
        console.log(sub + ' at index ' + index);
    }
    

    EDIT: Thanks to @nhahtdh, I've improved the mechanism and made a complete function:

    String.prototype.matchIndex = function(re){
        var res  = [];
        var subs = this.match(re);
    
        for (var cursor = subs.index, l = subs.length, i = 1; i < l; i++){
            var index = cursor;
    
            if (i+1 !== l && subs[i] !== subs[i+1]) {
                nextIndex = this.indexOf(subs[i+1], cursor);
                while (true) {
                    currentIndex = this.indexOf(subs[i], index);
                    if (currentIndex !== -1 && currentIndex <= nextIndex)
                        index = currentIndex + 1;
                    else
                        break;
                }
                index--;
            } else {
                index = this.indexOf(subs[i], cursor);
            }
            cursor = index + subs[i].length;
    
            res.push([subs[i], index]);
        }
        return res;
    }
    
    
    console.log("aabccde".matchIndex(/(a).(b)(c.)d/));
    // [ [ 'a', 1 ], [ 'b', 2 ], [ 'cc', 3 ] ]
    
    console.log("aaa".matchIndex(/(a).(.)/));
    // [ [ 'a', 0 ], [ 'a', 1 ] ] <-- problem here
    
    console.log("bababaaaaa".matchIndex(/(ba)+.(a*)/));
    // [ [ 'ba', 4 ], [ 'aaa', 6 ] ]
    
    0 讨论(0)
  • 2020-11-29 05:10

    With RegExp.prototype.exec() and searching the properly indexes of the result:

    let regex1 = /([a-z]+):([0-9]+)/g;
    let str1 = 'hello:123';
    let array1;
    let resultArray = []
    
    while ((array1 = regex1.exec(str1)) !== null) {
      const quantityFound = (Object.keys(array1).length - 3); // 3 default keys
      for (var i = 1; i<quantityFound; i++) { // start in 1 to avoid the complete found result 'hello:123'
        const found = array1[i];
        arraySingleResult = [found, str1.indexOf(found)];
        resultArray.push(arraySingleResult);
      }
    }
    console.log('result:', JSON.stringify(resultArray));
    
    0 讨论(0)
  • 2020-11-29 05:15

    I created a little regexp Parser which is also able to parse nested groups like a charm. It's small but huge. No really. Like Donalds hands. I would be really happy if someone could test it, so it will be battle tested. It can be found at: https://github.com/valorize/MultiRegExp2

    Usage:

    let regex = /a(?: )bc(def(ghi)xyz)/g;
    let regex2 = new MultiRegExp2(regex);
    
    let matches = regex2.execForAllGroups('ababa bcdefghixyzXXXX'));
    
    Will output:
    [ { match: 'defghixyz', start: 8, end: 17 },
      { match: 'ghi', start: 11, end: 14 } ]
    
    0 讨论(0)
  • 2020-11-29 05:20

    I wrote MultiRegExp for this a while ago. As long as you don't have nested capture groups, it should do the trick. It works by inserting capture groups between those in your RegExp and using all the intermediate groups to calculate the requested group positions.

    var exp = new MultiRegExp(/(a).(b)(c.)d/);
    exp.exec("aabccde");
    

    should return

    {0: {index:0, text:'a'}, 1: {index:2, text:'b'}, 2: {index:3, text:'cc'}}
    

    Live Version

    0 讨论(0)
提交回复
热议问题