问题
I'm trying to parse a whatsapp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.
The chat.txt file looks like this:
[06.12.16, 16:46:19] Person One: Wow thats amazing
[06.12.16, 16:47:13] Person Two: Good morning and this goes over multiple
lines as it is a very long message
[06.12.16, 16:47:22] Person Two: ::
While my solution so far would parse most of these messages correctly, however I have a few hundred cases where the message starts with a colon, like the last example above. This leads to an unwanted value of Person Two: :
as the sender.
Here is the regex I am working with so far:
pattern = re.compile(r'\[(?P<date>\d{2}\.\d{2}\.\d{2}),\s(?P<time>\d{2}:\d{2}:\d{2})]\s(?P<sender>(?<=\s).*(?::\s*\w+)*(?=:)):\s(?P<message>(?:.+|\n+(?!\[\d{2}\.\d{2}\.\d{2}))+)')
Any advice on how I could go around this bug would be appreciated!
回答1:
i would pre-process the list to remove the consecutive colons before applying the regex. So for each line e.g
line = [06.12.16, 16:47:22] Person Two: ::
line = line.replace("::","")
which would give :
[06.12.16, 16:47:22] Person Two:
You can then call your regex function on the pre-processed data.
来源:https://stackoverflow.com/questions/55066839/whatsapp-chat-log-parsing-with-regex