Python string operation, extract text between html tags

后端 未结 6 1542
无人及你
无人及你 2020-12-03 12:48

I have a string:

  
JUL 28         

(it outputs over two lines, so there must

相关标签:
6条回答
  • 2020-12-03 13:15

    You have a bunch of options here. You could go for an all-out xml parser like lxml, though you seem to want a domain-specific solution. I'd go with a multiline regex:

    import re
    rex = re.compile(r'<font.*?>(.*?)</font>',re.S|re.M)
    ...
    data = """<font face="ARIAL,HELVETICA" size="-2">  
    JUL 28         </font>"""
    
    match = rex.match(data)
    if match:
        text = match.groups()[0].strip()
    

    Now that you have text, you can turn it into a date pretty easily:

    from datetime import datetime
    date = datetime.strptime(text, "%b %d")
    
    0 讨论(0)
  • 2020-12-03 13:15

    Is grep an option?

    grep "<[^>]*>(.*)<\/[^>]*>" file
    

    The (.*) should match your content.

    0 讨论(0)
  • 2020-12-03 13:19

    Use Scrapy's XPath selectors as documented at http://doc.scrapy.org/en/0.10.3/topics/selectors.html

    Alternatively you can utilize an HTML parser such as BeautifulSoup especially if want to operate on the document in an object oriented manner.

    http://pypi.python.org/pypi/BeautifulSoup/3.2.0

    0 讨论(0)
  • 2020-12-03 13:23

    While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well.

    >>> from BeautifulSoup import BeautifulSoup as BSHTML
    >>> BS = BSHTML("""
    ... <font face="ARIAL,HELVETICA" size="-2">  
    ... JUL 28         </font>"""
    ... )
    >>> BS.font.contents[0].strip()
    u'JUL 28'
    

    Then you just need to parse the date:

    >>> datetime.strptime(BS.font.contents[0].strip(), '%b %d')
    >>> datetime.datetime(1900, 7, 28, 0, 0)
    datetime.datetime(1900, 7, 28, 0, 0)
    
    0 讨论(0)
  • 2020-12-03 13:23

    Python has a library called HTMLParser. Also see the following question posted in SO which is very similar to what you are looking for:

    How can I use the python HTMLParser library to extract data from a specific div tag?

    0 讨论(0)
  • 2020-12-03 13:35

    Or, you could simply use Beautiful Soup:

    Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping

    0 讨论(0)
提交回复
热议问题