Python Regex can't find substring but it should

前端 未结 2 886
情深已故
情深已故 2021-01-25 20:49

I am trying to parse html using BeautifulSoup to try and extract the webpage title. Sometimes this does not work due to the website being badly written, such as Bad End tag. W

相关标签:
2条回答
  • 2021-01-25 21:01

    You should use the dotall flag to make the . match newline characters as well.

    result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)
    

    As the documentation says:

    ...without this flag, '.' will match anything except a newline

    0 讨论(0)
  • 2021-01-25 21:09

    If you want to grab the test between the <title> and <\title> tags you should use this regexp:

    pattern = "<title>([^<]+)</title>"
    
    re.findall(pattern, html_string) 
    
    0 讨论(0)
提交回复
热议问题