Beautiful Soup Nested Tag Search

后端 未结 3 611
孤城傲影
孤城傲影 2021-01-12 04:35

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (

相关标签:
3条回答
  • 2021-01-12 05:14

    Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:

    soup = bs4.BeautifulSoup(content, 'html.parser') 
    
    # This will get the div
    div_container = soup.find('div', class_='some_class')  
    
    # Then search in that div_container for all p tags with class "hello"
    for ptag in div_container.find_all('p', class_='hello'):
        # prints the p tag content
        print(ptag.text)
    

    Hope that helps

    0 讨论(0)
  • 2021-01-12 05:35

    UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs we read that there is a method called get_text(), use it as:

    from bs4 import BeautifulSoup
    fd = open('index.html', 'r')
    website= fd.read()
    fd.close()
    soup = BeautifulSoup(website)
    contents= soup.get_text(separator=" ")
    print "number of words %d" %len(contents.split(" "))
    

    INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:

    from bs4 import BeautifulSoup
    import re
    BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
    fd = open('index.html', 'r')
    website= fd.read()
    soup = BeautifulSoup(website)
    tags=soup.find_all(True) # find everything
    print "there are %d" %len(tags)
    
    count= 0
    matcher= re.compile("(\s|\n|<br>)+")
    for tag in tags:
    if tag.name.lower() in BLACKLIST:
        continue
        temp = matcher.split(tag.text) # Split using tokens such as \s and \n
        temp = filter(None, temp) # remove empty elements in the list
        count +=len(temp)
    print "number of words in the document %d" %count
    fd.close()
    

    Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason

    0 讨论(0)
  • 2021-01-12 05:38

    Try this one :

    data = []
    for nested_soup in soup.find_all('xyz'):
        data = data + nested_soup.find_all('abc')
    # data holds all shit together
    

    Maybe you can turn in into lambda and make it cool, but this works. Thanks.

    0 讨论(0)
提交回复
热议问题