Whatsapp chat log parsing with regex

╄→гoц情女王★ 提交于 2019-12-11 19:28:35

问题


I'm trying to parse a whatsapp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.

The chat.txt file looks like this:

[06.12.16, 16:46:19] Person One: Wow thats amazing
[06.12.16, 16:47:13] Person Two: Good morning and this goes over multiple
lines as it is a very long message
[06.12.16, 16:47:22] Person Two: ::

While my solution so far would parse most of these messages correctly, however I have a few hundred cases where the message starts with a colon, like the last example above. This leads to an unwanted value of Person Two: : as the sender.

Here is the regex I am working with so far:

pattern = re.compile(r'\[(?P<date>\d{2}\.\d{2}\.\d{2}),\s(?P<time>\d{2}:\d{2}:\d{2})]\s(?P<sender>(?<=\s).*(?::\s*\w+)*(?=:)):\s(?P<message>(?:.+|\n+(?!\[\d{2}\.\d{2}\.\d{2}))+)')

Any advice on how I could go around this bug would be appreciated!


回答1:


i would pre-process the list to remove the consecutive colons before applying the regex. So for each line e.g

 line = [06.12.16, 16:47:22] Person Two: ::
 line = line.replace("::","")

which would give :

[06.12.16, 16:47:22] Person Two: 

You can then call your regex function on the pre-processed data.



来源:https://stackoverflow.com/questions/55066839/whatsapp-chat-log-parsing-with-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!