Good evening, I have used BeautifulSoup to extract some data from a website as follows:
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
soup
Here is a basic thing you can try. This makes the assumption that the headers
are all in the <th>
tags, and that all subsequent data is in the <td>
tags. This works in the single case you provided, but I'm sure adjustments will be necessary if other cases :) The general idea is that once you find your table
(here using find
to pull the first one), we get the headers
by iterating through all th
elements, storing them in a list. Then, we create a rows
list that will contain lists representing the contents of each row. This is populated by finding all td
elements under tr
tags and taking the text
, encoding it in UTF-8 (from Unicode). You then open a CSV, writing the headers
first and then writing all of the rows, but using
(row for row in rows if row)` to eliminate any blank rows):
In [117]: import csv
In [118]: from bs4 import BeautifulSoup
In [119]: from urllib2 import urlopen
In [120]: soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))
In [121]: table = soup.find('table', attrs={ "class" : "table-horizontal-line"})
In [122]: headers = [header.text for header in table.find_all('th')]
In [123]: rows = []
In [124]: for row in table.find_all('tr'):
.....: rows.append([val.text.encode('utf8') for val in row.find_all('td')])
.....:
In [125]: with open('output_file.csv', 'wb') as f:
.....: writer = csv.writer(f)
.....: writer.writerow(headers)
.....: writer.writerows(row for row in rows if row)
.....:
In [126]: cat output_file.csv
Amount,Company or person fined,Date,What was the fine for?,Compensation
" £4,000,000",Credit Suisse First Boston International ,19/12/02,Attempting to mislead the Japanese regulatory and tax authorities,
"£750,000",Royal Bank of Scotland plc,17/12/02,Breaches of money laundering rules,
"£1,000,000",Abbey Life Assurance Company ltd,04/12/02,Mortgage endowment mis-selling and other failings,Compensation estimated to be between £120 and £160 million
"£1,350,000",Royal & Sun Alliance Group,27/08/02,Pension review failings,Redress exceeding £32 million
"£4,000",F T Investment & Insurance Consultants,07/08/02,Pensions review failings,
"£75,000",Seymour Pierce Ellis ltd,18/06/02,"Breaches of FSA Principles (""skill, care and diligence"" and ""internal organization"")",
"£120,000",Ward Consultancy plc,14/05/02,Pension review failings,
"£140,000",Shawlands Financial Services ltd - formerly Frizzell Life & Financial Planning ltd),11/04/02,Record keeping and associated compliance breaches,
"£5,000",Woodward's Independent Financial Advisers,04/04/02,Pensions review failings,