How can I scrape an HTML table to CSV?

后端 未结 11 1315
悲&欢浪女
悲&欢浪女 2020-11-29 21:56

The Problem

I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.

A lot of this inf

相关标签:
11条回答
  • 2020-11-29 22:12

    If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.

    0 讨论(0)
  • 2020-11-29 22:15

    Here is a tested example that combines grequest and soup to download large quantities of pages from a structured website:

    #!/usr/bin/python
    
    from bs4 import BeautifulSoup
    import sys
    import re
    import csv
    import grequests
    import time
    
    def cell_text(cell):
        return " ".join(cell.stripped_strings)
    
    def parse_table(body_html):
        soup = BeautifulSoup(body_html)
        for table in soup.find_all('table'):
            for row in table.find_all('tr'):
                col = map(cell_text, row.find_all(re.compile('t[dh]')))
                print(col)
    
    def process_a_page(response, *args, **kwargs): 
        parse_table(response.content)
    
    def download_a_chunk(k):
        chunk_size = 10 #number of html pages
        x = "http://www.blahblah....com/inclusiones.php?p="
        x2 = "&name=..."
        URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
        reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
        resp = grequests.map(reqs, size=10)
    
    # download slowly so the server does not block you
    for k in range(0,500):
        print("downloading chunk ",str(k))
        download_a_chunk(k)
        time.sleep(11)
    
    0 讨论(0)
  • 2020-11-29 22:18

    Two ways come to mind (especially for those of us that don't have Excel):

    • Google Spreadsheets has an excellent importHTML function:
      • =importHTML("http://example.com/page/with/table", "table", index
      • Index starts at 1
      • I recommend a copy and paste values shortly after import
      • File -> Download as -> CSV
    • Python's superb Pandas library has handy read_html and to_csv functions
      • Here's a basic Python3 script that prompts for the URL, which table at that URL, and a filename for the CSV.
    0 讨论(0)
  • 2020-11-29 22:21

    This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,

    $ sudo easy_install beautifulsoup4
    

    The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.

    #!/usr/bin/python
    from bs4 import BeautifulSoup
    import sys
    import re
    import csv
    
    def cell_text(cell):
        return " ".join(cell.stripped_strings)
    
    soup = BeautifulSoup(sys.stdin.read())
    output = csv.writer(sys.stdout)
    
    for table in soup.find_all('table'):
        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            output.writerow(col)
        output.writerow([])
    
    0 讨论(0)
  • 2020-11-29 22:27
    • Select the HTML table in your tools's UI and copy it into the clipboard (if that's possible
    • Paste it into Excel.
    • Save as CSV file

    However, this is a manual solution not an automated one.

    0 讨论(0)
提交回复
热议问题