Matlab regular expressions capture groups with named tokens

我们两清 提交于 2019-12-11 05:23:16

问题


I am trying to read a few text lines from a file in matlab. Using the regexp function to extract some named tokens. While everything works quite nice in octave I cannot get the same expression to work in Matlab.

There are different kinds of lines i want to process, like:

line1 = 'attr enabled  True';
line2 = 'attr width  1.2';
line3 = 'attr size  8Byte';

The regular expression I have come up with looks like:

pattern = '^attr +(?<name>\S+) +(?:(?<number>[+-]?\d+(?:\.\d+)?)(?<unit>[a-z,A-z]*)?|(?<bool>(?:[tT][rR][uU][eE]|[fF][aA][lL][sS][eE])))$'

When i run (in Matlab 2016b):

[tokens, matches] = regexp(line1, pattern, 'names', 'match');

The result looks like:

tokens  = 0×0 empty struct array with fields:
             name
matches = 0×0 empty cell array

The result in octave, however, looks like:

tokens = scalar structure containing the fields:
             name = enabled
             number =
             unit =
             bool = True
matches = { [1,1] = attr enabled  True }

I tested my regex with regexr.com which suggested that octave was working correctly.

As soon as I remove the outer capturing group from the regex pattern:

pattern = '^attr +(?<name>\S+) +(?<number>[+-]?\d+(?:\.\d+)?)(?<unit>[a-z,A-z]*)?|(?<bool>(?:[tT][rR][uU][eE]|[fF][aA][lL][sS][eE]))$'

Matlab outputs:

tokens = struct with fields:
              bool: 'True'
              name: []
              number: []
              unit: []
matches = { True }

So matlab starts recognizing the other named tokens as fields, but still the name field is empty. And furthermore the regex is no correct alternation anymore... Is that a bug concerning capture groups or do I terribly misunderstand something?


回答1:


Some simple tests suggests MATLAB does not support nested non-capturing groups with named params. Your best work around might be to use unnamed groups?

x1 = 'Apple Banana Cat';

% Named groups work:
re1 = regexp(x1, '(?<first>A.+) (?<second>B.+) (?<third>C.+)', 'names')

% Non-capturing (unnamed) groups work...
re2 = regexp(x1, '(?:A.+) (?<second>B.+) (?<third>C.+)', 'names')

% Nested non-capturing group does work, but not with named groups
re3 = regexp(x1, '(?:(A.+)) (?<second>B.+) (?<third>C.+)', 'names')         % OK
re4 = regexp(x1, '(?:(A.+)) (B.+) (C.+)', 'tokens')                         % OK (unnamed)
re5 = regexp(x1, '(?:(?<first>A.+)) (?<second>B.+) (?<third>C.+)', 'names') % Not OK

Sadly there is no single canonical regexp definition, there are lots of flavours. So just because it works with Octave or regexr.com is no guarantee it would or should work elsewhere, especially when you start getting into the more exotic regions of the regex.

I think you might have to work around it, though I'd be pleased to be proved wrong!

(PS My testing in v2016a, YMMV).

EDIT: I've now tested in both 2016a and 2016b "re4" works and gives the same results in both:

>> x1 = 'Apple Banana Cat';
>> re4 = regexp(x1, '(?:(A.+)) (B.+) (C.+)', 'tokens');

>> disp(re4{1}{1})
Banana

>> disp(re4{1}{2})
Cat



回答2:


Nested capturing groups would be the problem here.

I ran into this problem as well, which made me crazy. Eventually I think I found the Matlab documentation explaining what's going on:

Note: If an expression has nested parentheses, MATLAB captures tokens that correspond to the outermost set of parentheses. For example, given the search pattern '(and(y|rew))', MATLAB creates a token for 'andrew' but not for 'y' or 'rew'.

That's from the "Regular Expressions" help file of the Matlab documentation:

>> web(fullfile(docroot, 'matlab/matlab_prog/regular-expressions.html#btrvwd4'))

I'm running version 8.6.0.267246 (R2015b).

So this is a non-feature of, specifically, Matlab. Seems very limiting to me, but maybe I'm missing out on something.



来源:https://stackoverflow.com/questions/43244838/matlab-regular-expressions-capture-groups-with-named-tokens

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!