Python - find a substring between two strings based on the last occurence of the later string

半城伤御伤魂 提交于 2021-02-08 12:11:44

问题


I am trying to find a substring which is between to strings. The first string is <br> and the last string is <br><br>. The first string I look for is repetitive, while the later string can serve as an anchor.

Here is an example:

<div class="linkTabBl" style="float:left;padding-top:6px;width:240px">
    Anglo American plc
    <br>
    20 Carlton                 House Terrace
    <br>
    SW1Y 5AN London
    <br>
    United Kingdom
    <br><br>
    Phone : +44 (0)20 7968 8888
    <br>
    Fax : +44 (0)20 7968 8500
    <br>
    Internet : 
    <a class="pageprofil_link_blue" href="http://www.angloamerican.com" target="_blank">
        http://www.angloamerican.com
    </a>
    <br>
</div>

I am trying to get "United Kingdom". I would love to get this string with string manipulation but as well would be intesrted if anyone can get it with Beautifulsoup (ideally using css_selector).

All the best.

Web page


回答1:


import re

html = """<div class="linkTabBl" style="float:left;padding-top:6px;width:240px">
    Anglo American plc
    <br>
    20 Carlton                 House Terrace
    <br>
    SW1Y 5AN London
    <br>
    United Kingdom
    <br><br>
    Phone : +44 (0)20 7968 8888
    <br>
    Fax : +44 (0)20 7968 8500
    <br>
    Internet : 
    <a class="pageprofil_link_blue" href="http://www.angloamerican.com" target="_blank">
        http://www.angloamerican.com
    </a>
    <br>
</div>"""

res = re.findall(r'<br>\n    ([a-zA-Z\s]+)?\n    <br><br>', html)

print(res)

Note: "\n " is a new line and 4 spaces from <'br'> to what you are looking for to <'br'> again. So if you have something like this:

...
<br>United Kingdom<br><br>
...

You should replace

res = re.findall(r'<br>\n ([a-zA-Z\s]+)?\n <br><br>', html)

by

res = re.findall(r'<br>([a-zA-Z\s]+)?<br><br>', html)

Good regex lessons here https://regexone.com/




回答2:


You can get this using regex and the string of the html.

import requests, re

r = requests.get('https://www.marketscreener.com/ANGLO-AMERICAN-PLC-4007113/company/', headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'})

reg = re.search(r'<br>([\w\s]+)<br><br>', r.text).group(1)
print(reg)


来源:https://stackoverflow.com/questions/58124584/python-find-a-substring-between-two-strings-based-on-the-last-occurence-of-the

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!