I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.
A lot of this inf
If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.
Here is a tested example that combines grequest and soup to download large quantities of pages from a structured website:
#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv
import grequests
import time
def cell_text(cell):
return " ".join(cell.stripped_strings)
def parse_table(body_html):
soup = BeautifulSoup(body_html)
for table in soup.find_all('table'):
for row in table.find_all('tr'):
col = map(cell_text, row.find_all(re.compile('t[dh]')))
print(col)
def process_a_page(response, *args, **kwargs):
parse_table(response.content)
def download_a_chunk(k):
chunk_size = 10 #number of html pages
x = "http://www.blahblah....com/inclusiones.php?p="
x2 = "&name=..."
URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
resp = grequests.map(reqs, size=10)
# download slowly so the server does not block you
for k in range(0,500):
print("downloading chunk ",str(k))
download_a_chunk(k)
time.sleep(11)
Two ways come to mind (especially for those of us that don't have Excel):
=importHTML("http://example.com/page/with/table", "table", index
copy
and paste values
shortly after importThis is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,
$ sudo easy_install beautifulsoup4
The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.
#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv
def cell_text(cell):
return " ".join(cell.stripped_strings)
soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)
for table in soup.find_all('table'):
for row in table.find_all('tr'):
col = map(cell_text, row.find_all(re.compile('t[dh]')))
output.writerow(col)
output.writerow([])
However, this is a manual solution not an automated one.