Scraping Data from a Tableau Map

问题

I am trying to pull locations and names of Naloxone distribution centers in Illinois for a research project on the opioid crisis.

This tableau generated dashboard is accessible from here from the department of public health https://idph.illinois.gov/OpioidDataDashboard/

I've tried everything I could find. First changing the url to "download" the data using Tableau's interface. That only let me download a pdf map not the actual dataset behind it. Second, I modified the python script I've seen a few times on Stack overflow to try to request the data. But, I think it runs into some kind of error. Code below.

url = "https://interactive.data.illinois.gov/t/DPH/views/opioidTDWEB_prod/NaloxoneDistributionLocations"

r = requests.get(
    url,
    params= {
        ":embed":"y",
        ":showAppBanner":"false",
        ":showShareOptions":"true",
        ":display_count":"no",
        "showVizHome": "no"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
print(soup)
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'https://tableau.ons.org.br{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'

r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

Appreciate any help.

回答1:

It's a bit complex since there are a combination of the following :

the tableau "configuration page" where there is tsconfig textarea is not part of the original page. The url is built dynamically from some param html tag
it uses a cross forgery token in cookies but in order to get that cookie you need to call a specific api whose url is built dynamically from some param html tag
from the tsconfig parameter, we can build the data url as you've found out in other stackoverflow post such as this, this and this

The flow is the following :

call GET https://idph.illinois.gov/OpioidDataDashboard/, scrape the param tags under the div with class tableauPlaceholder

From there the host is : https://interactive.data.illinois.gov

from the former param tags, build the "session URL" which looks like this :

GET /trusted/{ticket}/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity

the url above will be only used to store the cookies (including xsrf token in the cookies)

from the former param tags, build the "configuration URL" which looks like this :
```
GET /t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
```

Extract the textarea with id tsConfigContainer and parse the json from it

build the "data url" from the json extracted above, the url looks like this :

POST /vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/{session_id}

Then you have a json response with some string in front of it to prevent json hijacking. You need regex to extract it and then parse the huge json data

All url needed would be like :

GET https://idph.illinois.gov/OpioidDataDashboard/
GET https://interactive.data.illinois.gov/trusted/yIm7jkXyRQuH9Ff1oPvz_w==:790xMcZuwmnvijXHg6ymRTrU/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
GET https://interactive.data.illinois.gov/t/DPH/views/opioidTDWEB_prod/MortalityandMorbidity
POST https://interactive.data.illinois.gov/vizql/t/DPH/w/opioidTDWEB_prod/v/MortalityandMorbidity/bootstrapSession/sessions/2A3E3BA96A6C4E65B36AEDB4A536D09F-1:0

The full code:

import requests
from bs4 import BeautifulSoup
import json
import re

s = requests.Session()

init_url = "https://idph.illinois.gov/OpioidDataDashboard/"
print(f"GET {init_url}")
r = s.get(init_url)
soup = BeautifulSoup(r.text, "html.parser")
paramTags = dict([
    (t["name"], t["value"]) 
    for t in soup.find("div", {"class":"tableauPlaceholder"}).findAll("param")
])

# get xsrf cookie
session_url = f'{paramTags["host_url"]}trusted/{paramTags["ticket"]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {session_url}")
r = s.get(session_url)

config_url = f'{paramTags["host_url"][:-1]}{paramTags["site_root"]}/views/{paramTags["name"]}'
print(f"GET {config_url}")
r = s.get(config_url,
    params = {
        ":embed": "y",
        ":showVizHome": "no",
        ":host_url": "https://interactive.data.illinois.gov/",
        ":embed_code_version": 2,
        ":tabs": "yes",
        ":toolbar": "no",
        ":showShareOptions": "false",
        ":display_spinner": "no",
        ":loadOrderID": 0,
})
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)

dataUrl = f'{paramTags["host_url"][:-1]}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
print(f"POST {dataUrl}")
r = s.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})
dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

Try this on repl.it

来源：https://stackoverflow.com/questions/63025296/scraping-data-from-a-tableau-map

标签

python

web-scraping

tableau-api

screen-scraping