Pandas read_html results in TypeError

后端 未结 3 1849
耶瑟儿~
耶瑟儿~ 2021-01-14 06:02

I\'m using bs4 to parse a html page and extract a table, sample table given below and I\'m trying to load it into pandas but when i call pddataframe = pd.read_html(LOT

相关标签:
3条回答
  • 2021-01-14 06:39

    This exact code works for me.

    htm = """<table cellpadding="5" cellspacing="0" class="borders" width="100%">
        <tr>
         <th colspan="2">
          Learning Outcomes
         </th>
        </tr>
        <tr>
         <td class="info" colspan="2">
          On successful completion of this module the learner will be able to:
         </td>
        </tr>
        <tr>
         <td style="width:10%;">
          LO1
         </td>
         <td>
          Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
         </td>
        </tr>
        <tr>
         <td style="width:10%;">
          LO2
         </td>
         <td>
          Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
         </td>
        </tr>
        <tr>
         <td style="width:10%;">
          LO3
         </td>
         <td>
          Understand the various formats in which  information in relation to transactions or events is recorded and classified.
         </td>
        </tr>
        <tr>
         <td style="width:10%;">
          LO4
         </td>
         <td>
          Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the  posting of  recorded information to the T accounts in the Nominal Ledger.
         </td>
        </tr>
        <tr>
         <td style="width:10%;">
          LO5
         </td>
         <td>
          Prepare and present the financial statements of a Sole Trader  in prescribed format from a Trial Balance  accompanies by notes with additional information.
         </td>
        </tr>
       </table> 
    """
    
    pd.read_html(htm, skiprows=2, flavor='bs4')[0]
    

    0 讨论(0)
  • 2021-01-14 06:40

    Pandas can guess.

    HTML = '''\
    <table cellpadding="5" cellspacing="0" class="borders" width="100%">
        <tr>
         <th colspan="2">
          Learning Outcomes
         </th>
    
    
    ... omitting most of what you had here
    
    
          Prepare and present the financial statements of a Sole Trader  in prescribed format from a Trial Balance  accompanies by notes with additional information.
         </td>
        </tr>
       </table>'''
    
    from io import StringIO
    import pandas as pd
    
    df = pd.read_html(StringIO(HTML))
    print (df)
    

    Result:

    [                                                   0  \
    0                                  Learning Outcomes   
    1  On successful completion of this module the le...   
    2                                                LO1   
    3                                                LO2   
    4                                                LO3   
    5                                                LO4   
    6                                                LO5   
    
                                                       1  
    0                                                NaN  
    1                                                NaN  
    2  Demonstrate an awareness of the important role...  
    3  Display an understanding of the fundamental ac...  
    4  Understand the various formats in which inform...  
    5  Apply a knowledge of accounting concepts,conve...  
    6  Prepare and present the financial statements o...  ]
    
    0 讨论(0)
  • 2021-01-14 06:45

    Thanks for the pointers from all the suggested answers and comments, my rookie mistake was I had the table in a variable after extracting it using bs4. I was running pd.read_html(LOTable,skiprows=2, flavor='bs4') when I needed to run pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4')

    0 讨论(0)
提交回复
热议问题