How can I scrape an HTML table to CSV?

后端 未结 11 1314
悲&欢浪女
悲&欢浪女 2020-11-29 21:56

The Problem

I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.

A lot of this inf

相关标签:
11条回答
  • 2020-11-29 22:06

    using python:

    for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes

    then...

    from BeautifulSoup import BeautifulSoup
    import urllib,string,csv,sys,os
    from string import replace
    
    date_s = '&date1=01/01/08'
    date_f = '&date=11/10/08'
    fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
    fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
    cur1,cur2 = 'USD','AUD'
    fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
    fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
    data = urllib.urlopen(fx_url).read()
    soup = BeautifulSoup(data)
    data = str(soup.findAll('pre', limit=1))
    data = replace(data,'[<pre>','')
    data = replace(data,'</pre>]','')
    file_location = '/Users/location_edit_this'
    file_name = file_location + 'usd_aus.csv'
    file = open(file_name,"w")
    file.write(data)
    file.close()
    

    edit: to get values from a table: example from: palewire

    from mechanize import Browser
    from BeautifulSoup import BeautifulSoup
    
    mech = Browser()
    
    url = "http://www.palewire.com/scrape/albums/2007.html"
    page = mech.open(url)
    
    html = page.read()
    soup = BeautifulSoup(html)
    
    table = soup.find("table", border=1)
    
    for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
    
        rank = col[0].string
        artist = col[1].string
        album = col[2].string
        cover_link = col[3].img['src']
    
        record = (rank, artist, album, cover_link)
        print "|".join(record)
    
    0 讨论(0)
  • 2020-11-29 22:06

    Even easier (because it saves it for you for next time) ...

    In Excel

    Data/Import External Data/New Web Query

    will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.

    0 讨论(0)
  • 2020-11-29 22:09

    Basic Python implementation using BeautifulSoup, also considering both rowspan and colspan:

    from BeautifulSoup import BeautifulSoup
    
    def table2csv(html_txt):
       csvs = []
       soup = BeautifulSoup(html_txt)
       tables = soup.findAll('table')
    
       for table in tables:
           csv = ''
           rows = table.findAll('tr')
           row_spans = []
           do_ident = False
    
           for tr in rows:
               cols = tr.findAll(['th','td'])
    
               for cell in cols:
                   colspan = int(cell.get('colspan',1))
                   rowspan = int(cell.get('rowspan',1))
    
                   if do_ident:
                       do_ident = False
                       csv += ','*(len(row_spans))
    
                   if rowspan > 1: row_spans.append(rowspan)
    
                   csv += '"{text}"'.format(text=cell.text) + ','*(colspan)
    
               if row_spans:
                   for i in xrange(len(row_spans)-1,-1,-1):
                       row_spans[i] -= 1
                       if row_spans[i] < 1: row_spans.pop()
    
               do_ident = True if row_spans else False
    
               csv += '\n'
    
           csvs.append(csv)
           #print csv
    
       return '\n\n'.join(csvs)
    
    0 讨论(0)
  • 2020-11-29 22:11

    Quick and dirty:

    Copy out of browser into Excel, save as CSV.

    Better solution (for long term use):

    Write a bit of code in the language of your choice that will pull the html contents down, and scrape out the bits that you want. You could probably throw in all of the data operations (sorting, averaging, etc) on top of the data retrieval. That way, you just have to run your code and you get the actual report that you want.

    It all depends on how often you will be performing this particular task.

    0 讨论(0)
  • 2020-11-29 22:11

    Excel can open a http page.

    Eg:

    1. Click File, Open

    2. Under filename, paste the URL ie: How can I scrape an HTML table to CSV?

    3. Click ok

    Excel does its best to convert the html to a table.

    Its not the most elegant solution, but does work!

    0 讨论(0)
  • 2020-11-29 22:11

    Have you tried opening it with excel? If you save a spreadsheet in excel as html you'll see the format excel uses. From a web app I wrote I spit out this html format so the user can export to excel.

    0 讨论(0)
提交回复
热议问题