Beautiful Soup Nested Tag Search

后端 未结 3 610
孤城傲影
孤城傲影 2021-01-12 04:35

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (

3条回答
  •  离开以前
    2021-01-12 05:35

    UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs we read that there is a method called get_text(), use it as:

    from bs4 import BeautifulSoup
    fd = open('index.html', 'r')
    website= fd.read()
    fd.close()
    soup = BeautifulSoup(website)
    contents= soup.get_text(separator=" ")
    print "number of words %d" %len(contents.split(" "))
    

    INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:

    from bs4 import BeautifulSoup
    import re
    BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
    fd = open('index.html', 'r')
    website= fd.read()
    soup = BeautifulSoup(website)
    tags=soup.find_all(True) # find everything
    print "there are %d" %len(tags)
    
    count= 0
    matcher= re.compile("(\s|\n|
    )+") for tag in tags: if tag.name.lower() in BLACKLIST: continue temp = matcher.split(tag.text) # Split using tokens such as \s and \n temp = filter(None, temp) # remove empty elements in the list count +=len(temp) print "number of words in the document %d" %count fd.close()

    Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason

提交回复
热议问题