For the past few hours I\'ve been trying to match address(es) from the following sample data and I can\'t get it to work:
medicalHistory None
address
The problem with your regex is that +
is greedy and goes until it finds a character out of that group, the @
in the first case and -
in the second.
Another approach is to use a non-greedy quantifier and a positive look-ahead for a newline followed by a word-character, like (python version):
re.findall(r'address\s+.*?(?=\n\w)', s, re.DOTALL)
It yields:
['address 24 Lewin Street, KUBURA, \n NSW, Australia',
'address 16 Yarra Street, \n LAWRENCE, VIC, Australia']
I would do it this way:
address\s+((?![\r\n]+\w)[0-9a-zA-Z, \r\n\t])+
See it here on Regexr.
This ((?![\r\n]+\w)[0-9a-zA-Z, \r\n\t])+
is the important part, where I say, match the next character from [0-9a-zA-Z, \r\n\t]
, if (?![\r\n]+\w)
is not following. This is matching what you expect.
In both your cases the regex stopped matching because of a character that is not included in your character class. If you want to go that way than you would need to combine a lazy quantifier and a positive lookahead:
address\s+([0-9a-zA-Z, \n\r\t]+?)(?=\r\w)
[0-9a-zA-Z, \n\r\t]+?
is matching as less as possible till the condition (?=\r\w)
is true.
See it here at Regexr