问题
Is there a way to use a non-greedy regular expression in C like one can use in Perl? I tried several things, but it's actually not working.
I'm currently using this regex that matches an IP address and the corresponding HTTP request, but it's greedy although I'm using the *?:
([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1
In this example, it always matches the whole string:
#include <regex.h>
#include <stdio.h>
int main() {
int a, i;
regex_t re;
regmatch_t pm;
char *mpages = "TEST 127.0.0.1 GET /test.php HTTP/1.1\" 404 525 \"-\" \"Mozilla/5.0 (Windows NT HTTP/1.1 TEST";
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED);
if(a!=0)
printf(" -> Error: Invalid Regex");
a = regexec(&re, &mpages[0], 1, &pm, REG_EXTENDED);
if(a==0) {
for(i = pm.rm_so; i < pm.rm_eo; i++)
printf("%c", mpages[i]);
printf("\n");
}
return 0;
}
$ ./regtest
127.0.0.1 GET /test.php HTTP/1.1" 404 525 "-" "Mozilla/5.0 (Windows NT HTTP/1.1
回答1:
No, there are no non-greedy quantifiers in POSIX regular expressions. But there is a library that provides perl-like regular expressions for C: http://www.pcre.org/
回答2:
As I said earlier in a comment, use grep -E
to run tests with POSIX regexes, in that way development time will be improved. Either way, It seems your problem it's with the regular expression rather than with the missing feature.
I'm not quite clear of what you want to grab from the request... supposing you just want the IP address, the HTTP verb and the resource, one could end up with the following regex.
regcomp(&re, "\\b(.?[0-9])+\\s+(GET|POST|PUT)\\s+([^ ]+)", REG_EXTENDED);
Be aware that several assumptions have been made. For example, this regex assumes the IP address will be well formed, it also assumes a request with a HTTP verb either GET, POST, PUT. Edit accordantly to your needs.
回答3:
The brute-force method of getting a regex to match up to the next occurrence of a word is:
"([^H]|H[^T]|HT[^T]|HTT[^P]|HTTP{^/]|HTTP/[^1]|HTTP/1[^.]|HTTP/1\\.[^1])*HTTP/1\\.1"
unless you can get smarter about your match -- which you can: HTTP requests are
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
and none of the nonterminals on the right match embedded spaces. So:
"[0-9]{1,3}(\\.[0-9]{1,3}){3} [^ ]* [^ ]* HTTP/1\\.1"
since you're only allocating space for the whole-expression match, or put the parens back in to get pieces.
回答4:
a = regcomp(&re, "([0-9]{1,3}(\\.[0-9]{1,3}){3})(.*?)HTTP/1.1", REG_EXTENDED|REG_ENHANCED);
Doesn't have this macro in the old time
#if __MAC_OS_X_VERSION_MIN_REQUIRED >= __MAC_10_8 \
|| __IPHONE_OS_VERSION_MIN_REQUIRED >= __IPHONE_6_0
#define REG_ENHANCED 0400 /* Additional (non-POSIX) features */
#endif
回答5:
In your code, pm
should be an array of regmatch_t
, and in your case, should have at least 2 to 4 elements, depending upon which () sub-expressions you want to capture.
You have only one element. The first element, pm[0]
, always gets whatever text matches your entire RE. That's the one you'll be getting. It is pm[1]
that will get the text of the first () sub-expression (the IP address), and pm[3]
that will get the text matching your (.*?)
term.
But even so, as stated above (by Wumbley, W. Q.) the POSIX regex library may not support non-greedy quantifiers.
来源:https://stackoverflow.com/questions/20239817/posix-regular-expression-non-greedy