Converting HTML list to nested Python list

后端 未结 2 639
独厮守ぢ
独厮守ぢ 2021-01-03 14:08

If I have a nested html (unordered) list that looks like this:

  • Page1_Level1
相关标签:
2条回答
  • 2021-01-03 14:42

    You can take a recursive approach:

    from pprint import pprint
    from bs4 import BeautifulSoup
    
    text = """your html goes here"""
    
    def find_li(element):
        return [{li.a['href']: find_li(li)}
                for ul in element('ul', recursive=False)
                for li in ul('li', recursive=False)]
    
    
    soup = BeautifulSoup(text, 'html.parser')
    data = find_li(soup)
    pprint(data)
    

    Prints:

    [{u'Page1_Level1.html': [{u'Page1_Level2.html': [{u'Page1_Level3.html': []},
                                                     {u'Page2_Level3.html': []},
                                                     {u'Page3_Level3.html': []}]}]},
     {u'Page2_Level1.html': [{u'Page2_Level2.html': []}]}]
    

    FYI, here is why I had to use html.parser here:

    • Don't put html, head and body tags automatically, beautifulsoup
    0 讨论(0)
  • 2021-01-03 14:50

    It is an overview of a possible solution

    # variable 'markup' contains the html string
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(markup)
    for a in soup.descendants:
       # construct a nested list when going thru the descendants
       print id(a), id(a.parent) if a.parent else None, a
    
    0 讨论(0)
提交回复
热议问题