I am trying to parse html using BeautifulSoup to try and extract the webpage title. Sometimes this does not work due to the website being badly written, such as Bad End tag. W
You should use the dotall flag to make the .
match newline characters as well.
result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)
As the documentation says:
...without this flag,
'.'
will match anything except a newline
If you want to grab the test between the <title>
and <\title>
tags you should use this regexp:
pattern = "<title>([^<]+)</title>"
re.findall(pattern, html_string)