beautifulSoup html csv

后端 未结 1 1260
野的像风
野的像风 2021-02-05 18:51

Good evening, I have used BeautifulSoup to extract some data from a website as follows:

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

soup         


        
相关标签:
1条回答
  • 2021-02-05 19:34

    Here is a basic thing you can try. This makes the assumption that the headers are all in the <th> tags, and that all subsequent data is in the <td> tags. This works in the single case you provided, but I'm sure adjustments will be necessary if other cases :) The general idea is that once you find your table (here using find to pull the first one), we get the headers by iterating through all th elements, storing them in a list. Then, we create a rows list that will contain lists representing the contents of each row. This is populated by finding all td elements under tr tags and taking the text, encoding it in UTF-8 (from Unicode). You then open a CSV, writing the headers first and then writing all of the rows, but using(row for row in rows if row)` to eliminate any blank rows):

    In [117]: import csv
    
    In [118]: from bs4 import BeautifulSoup
    
    In [119]: from urllib2 import urlopen
    
    In [120]: soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))
    
    In [121]: table = soup.find('table', attrs={ "class" : "table-horizontal-line"})
    
    In [122]: headers = [header.text for header in table.find_all('th')]
    
    In [123]: rows = []
    
    In [124]: for row in table.find_all('tr'):
       .....:     rows.append([val.text.encode('utf8') for val in row.find_all('td')])
       .....: 
    
    In [125]: with open('output_file.csv', 'wb') as f:
       .....:     writer = csv.writer(f)
       .....:     writer.writerow(headers)
       .....:     writer.writerows(row for row in rows if row)
       .....: 
    
    In [126]: cat output_file.csv
    Amount,Company or person fined,Date,What was the fine for?,Compensation
    " £4,000,000",Credit Suisse First Boston International ,19/12/02,Attempting to mislead the Japanese regulatory and tax authorities, 
    "£750,000",Royal Bank of Scotland plc,17/12/02,Breaches of money laundering rules, 
    "£1,000,000",Abbey Life Assurance Company ltd,04/12/02,Mortgage endowment mis-selling and other failings,Compensation estimated to be between £120 and £160 million
    "£1,350,000",Royal & Sun Alliance Group,27/08/02,Pension review failings,Redress exceeding £32 million
    "£4,000",F T Investment & Insurance Consultants,07/08/02,Pensions review failings, 
    "£75,000",Seymour Pierce Ellis ltd,18/06/02,"Breaches of FSA Principles (""skill, care and diligence"" and ""internal organization"")", 
    "£120,000",Ward Consultancy plc,14/05/02,Pension review failings, 
    "£140,000",Shawlands Financial Services ltd - formerly Frizzell Life & Financial Planning ltd),11/04/02,Record keeping and associated compliance breaches, 
    "£5,000",Woodward's Independent Financial Advisers,04/04/02,Pensions review failings, 
    
    0 讨论(0)
提交回复
热议问题