How to find the comment tag <!--…--> with BeautifulSoup?

前端 未结 2 552
醉话见心
醉话见心 2020-12-01 12:54

I tried soup.find(\'!--\') but it doesn\'t seem to work. Thanks in advance.

Edit: Thanks for the tip on how to find all comments. I have a follow up question. How d

相关标签:
2条回答
  • 2020-12-01 13:16

    You can find all the comments in a document with via the findAll method. See this example showing how to do exactly what you're trying to do Removing elements:

    In brief, you want this:

    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    

    Edit: If you're trying to search within the columns, you can try:

    import re
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    for comment in comments:
      e = re.match(r'<i>([^<]*)</i>', comment.string).group(1)
      print e
    
    0 讨论(0)
  • 2020-12-01 13:31

    Pyparsing allows you to search for HTML comments using a builtin htmlComment expression, and attach parse-time callbacks to validate and extract the various data fields within the comment:

    from pyparsing import makeHTMLTags, oneOf, withAttribute, Word, nums, Group, htmlComment
    import calendar
    
    # have pyparsing define tag start/end expressions for the 
    # tags we want to look for inside the comments
    span,spanEnd = makeHTMLTags("span")
    i,iEnd = makeHTMLTags("i")
    
    # only want spans with class=titlefont
    span.addParseAction(withAttribute(**{'class':'titlefont'}))
    
    # define what specifically we are looking for in this comment
    weekdayname = oneOf(list(calendar.day_name))
    integer = Word(nums)
    dateExpr = Group(weekdayname("day") + integer("daynum"))
    commentBody = '<!--' + span + i + dateExpr("date") + iEnd
    
    # define a parse action to attach to the standard htmlComment expression,
    # to extract only what we want (or raise a ParseException in case 
    # this is not one of the comments we're looking for)
    def grabCommentContents(tokens):
        return commentBody.parseString(tokens[0])
    htmlComment.addParseAction(grabCommentContents)
    
    
    # let's try it
    htmlsource = """
    want to match this one
    <!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->
    
    don't want the next one, wrong span class
    <!-- <span class="bodyfont"> <i>Wednesday 110519</i>(05:00PM)<br /></span> -->
    
    not even a span tag!
    <!-- some other text with a date in italics <i>Wednesday 110520</i>(05:00PM)<br /></span> -->
    
    another matching comment, on a different day
    <!-- <span class="titlefont"> <i>Thursday 110521</i>(05:00PM)<br /></span> -->
    """
    
    for comment in htmlComment.searchString(htmlsource):
        parsedDate = comment.date
        # date info can be accessed like elements in a list
        print parsedDate[0], parsedDate[1]
        # because we named the expressions within the dateExpr Group
        # we can also get at them by name (this is much more robust, and 
        # easier to maintain/update later)
        print parsedDate.day
        print parsedDate.daynum
        print
    

    Prints:

    Wednesday 110518
    Wednesday
    110518
    
    Thursday 110521
    Thursday
    110521
    
    0 讨论(0)
提交回复
热议问题