How can I scrape an HTML table to CSV?

后端未结

关注

 11  1315

The Problem

I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.

A lot of this inf

相关标签:

11条回答

旧时难觅i

2020-11-29 22:12

If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.

0 讨论(0)
发布评论:

提交评论
- 加载中...

遇见更好的自我

2020-11-29 22:15

Here is a tested example that combines grequest and soup to download large quantities of pages from a structured website:

#!/usr/bin/python

from bs4 import BeautifulSoup
import sys
import re
import csv
import grequests
import time

def cell_text(cell):
    return " ".join(cell.stripped_strings)

def parse_table(body_html):
    soup = BeautifulSoup(body_html)
    for table in soup.find_all('table'):
        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            print(col)

def process_a_page(response, *args, **kwargs): 
    parse_table(response.content)

def download_a_chunk(k):
    chunk_size = 10 #number of html pages
    x = "http://www.blahblah....com/inclusiones.php?p="
    x2 = "&name=..."
    URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
    reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
    resp = grequests.map(reqs, size=10)

# download slowly so the server does not block you
for k in range(0,500):
    print("downloading chunk ",str(k))
    download_a_chunk(k)
    time.sleep(11)

0 讨论(0)

迷失自我

2020-11-29 22:18
Two ways come to mind (especially for those of us that don't have Excel):
- Google Spreadsheets has an excellent importHTML function:
  - =importHTML("http://example.com/page/with/table", "table", index
  - Index starts at 1
  - I recommend a copy and paste values shortly after import
  - File -> Download as -> CSV
- Python's superb Pandas library has handy read_html and to_csv functions
  - Here's a basic Python3 script that prompts for the URL, which table at that URL, and a filename for the CSV.
0 讨论(0)
发布评论:

提交评论
- 加载中...

日久生厌

2020-11-29 22:21

This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,

$ sudo easy_install beautifulsoup4

The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.

#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv

def cell_text(cell):
    return " ".join(cell.stripped_strings)

soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)

for table in soup.find_all('table'):
    for row in table.find_all('tr'):
        col = map(cell_text, row.find_all(re.compile('t[dh]')))
        output.writerow(col)
    output.writerow([])

0 讨论(0)

旧时难觅i

2020-11-29 22:27
- Select the HTML table in your tools's UI and copy it into the clipboard (if that's possible
- Paste it into Excel.
- Save as CSV file
However, this is a manual solution not an automated one.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2