Python Regex can't find substring but it should

前端未结

关注

 2  886

I am trying to parse html using BeautifulSoup to try and extract the webpage title. Sometimes this does not work due to the website being badly written, such as Bad End tag. W

相关标签:

2条回答

死守一世寂寞

2021-01-25 21:01
You should use the dotall flag to make the . match newline characters as well.
```
result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)
```
As the documentation says:

...without this flag, '.' will match anything except a newline
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2021-01-25 21:09
If you want to grab the test between the <title> and <\title> tags you should use this regexp:
```
pattern = "<title>([^<]+)</title>"

re.findall(pattern, html_string) 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...