问题
I am trying to extract venue from a file which contains several articles using regex. I know that the venue starts with either For/From and is followed by date which starts with a day of the week or author's name if the date is missing, I wrote the following regex to match the venue, however it always matches everything till the author's name which means the date also comes in the venue if that article has a date.
"""((?<=\n)(?:(?:\bFrom\b)|(?:\bFor\b)).*?(?=(?:(?:Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)|(?:[A-Z]+))))""".r
Why is my code not matching the days if it is encountered but rather goes ahead to match [A-Z] which is the author's name.
Input: "The Consequences of Hostilities Between the States
From the New York Packet.
Tuesday, November 20, 1787.
HAMILTON
To the People of the State of New York:"
The line "Tuesday, November 20, 1787." is optional and may not occur in all articles. I want the output to be "From the New York Packet." I am getting the correct output for articles that do not have a date, however I am getting the output "From the New York Packet.
Tuesday, November 20, 1787." for articles that contain the date.
回答1:
Based on your edit, all you really need is
^(From|For).*
with the multiline flag.
I know that the venue starts with either For/From
and is followed by date which starts with a day of the week or author's name if the date is missing
it always matches everything till the author's name which means the date also comes in the venue if that article has a date.
Sounds like you want to find an entire line within a text file that begins with "From" or "For"
^(From|For)
(Set the multiline flag on so that ^
matches the beginning of a line rather than the beginning of input).
is followed by an optional date
\s+(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)?
followed by the author's name
\s+\w+\s+\w+
followed by everything until the end of the line
.*
Unless, of course you mean that you want to skip the date and match only the For/From and the author's name (not the date). That cannot be done in Regex alone - you can use grouping to extract the desired values, though.
回答2:
You only need to capture the entire line that starts with For or From, so you can simply use this:
^(For|From).*$
The ^ and $ anchor the match to the start and end of the line, and the .* matches everything inbetween.
Here, try it out with whatever examples you like.
If this needs to be more complicated, I'll update my answer.
来源:https://stackoverflow.com/questions/14760355/lookahead-in-regex