Extract part of a regex match

前端 未结 9 1880
北海茫月
北海茫月 2020-11-22 13:01

I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search(\'.*\', html, re.IGNOR         


        
相关标签:
9条回答
  • 2020-11-22 14:01

    May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

    soup = BeatifulSoup(html_doc)
    titleName = soup.title.name
    
    0 讨论(0)
  • 2020-11-22 14:01

    The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).

    I therefore propose the following improvement:

    import re
    
    def search_title(html):
        m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
        return m.group(1) if m else None
    

    Test cases:

    print(search_title("<title   >with spaces in tags</title >"))
    print(search_title("<title\n>with newline in tags</title\n>"))
    print(search_title("<title>first of two titles</title><title>second title</title>"))
    print(search_title("<title>with newline\n in title</title\n>"))
    

    Output:

    with spaces in tags
    with newline in tags
    first of two titles
    with newline
      in title
    

    Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.

    0 讨论(0)
  • 2020-11-22 14:05

    Try using capturing groups:

    title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
    
    0 讨论(0)
提交回复
热议问题