问题
I am trying to pull locations and names of Naloxone distribution centers in Illinois for a research project on the opioid crisis.
This tableau generated dashboard is accessible from here from the department of public health https://idph.illinois.gov/OpioidDataDashboard/
I've tried everything I could find. First changing the url to "download" the data using Tableau's interface. That only let me download a pdf map not the actual dataset behind it. Second, I modified the python script I've seen a few times on Stack overflow to try to request the data. But, I think it runs into some kind of error. Code below.
url = "https://interactive.data.illinois.gov/t/DPH/views/opioidTDWEB_prod/NaloxoneDistributionLocations"
r = requests.get(
url,
params= {
":embed":"y",
":showAppBanner":"false",
":showShareOptions":"true",
":display_count":"no",
"showVizHome": "no"
}
)
soup = BeautifulSoup(r.text, "html.parser")
print(soup)
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'https://tableau.ons.org.br{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))
print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])
Appreciate any help.
回答1:
It's a bit complex since there are a combination of the following :
- the tableau "configuration page" where there is tsconfig textarea is not part of the original page. The url is built dynamically from some
param
html tag - it uses a cross forgery token in cookies but in order to get that cookie you need to call a specific api whose url is built dynamically from some
param
html tag - from the tsconfig parameter, we can build the data url as you've found out in other stackoverflow post such as this, this and this
The flow is the following :
- call
GET https://idph.illinois.gov/OpioidDataDashboard/
, scrape theparam
tags under the div with classtableauPlaceholder
From there the host is : https://interactive.data.illinois.gov
from the former
param
tags, build the "session URL" which looks like this :GET /trusted/{ticket}/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
the url above will be only used to store the cookies (including xsrf token in the cookies)
from the former
param
tags, build the "configuration URL" which looks like this :GET /t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
Extract the textarea with id tsConfigContainer
and parse the json from it
build the "data url" from the json extracted above, the url looks like this :
POST /vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/{session_id}
Then you have a json response with some string in front of it to prevent json hijacking. You need regex to extract it and then parse the huge json data
All url needed would be like :
GET https://idph.illinois.gov/OpioidDataDashboard/
GET https://interactive.data.illinois.gov/trusted/yIm7jkXyRQuH9Ff1oPvz_w==:790xMcZuwmnvijXHg6ymRTrU/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
GET https://interactive.data.illinois.gov/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
POST https://interactive.data.illinois.gov/vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/2A3E3BA96A6C4E65B36AEDB4A536D09F-1:0
The full code:
import requests
from bs4 import BeautifulSoup
import json
import re
s = requests.Session()
init_url = "https://idph.illinois.gov/OpioidDataDashboard/"
print(f"GET {init_url}")
r = s.get(init_url)
soup = BeautifulSoup(r.text, "html.parser")
paramTags = dict([
(t["name"], t["value"])
for t in soup.find("div", {"class":"tableauPlaceholder"}).findAll("param")
])
# get xsrf cookie
session_url = f'{paramTags["host_url"]}trusted/{paramTags["ticket"]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {session_url}")
r = s.get(session_url)
config_url = f'{paramTags["host_url"][:-1]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {config_url}")
r = s.get(config_url,
params = {
":embed": "y",
":showVizHome": "no",
":host_url": "https://interactive.data.illinois.gov/",
":embed_code_version": 2,
":tabs": "yes",
":toolbar": "no",
":showShareOptions": "false",
":display_spinner": "no",
":loadOrderID": 0,
})
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{paramTags["host_url"][:-1]}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
print(f"POST {dataUrl}")
r = s.post(dataUrl, data= {
"sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))
print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])
Try this on repl.it
来源:https://stackoverflow.com/questions/63025296/scraping-data-from-a-tableau-map