How to scrape a Tableau dashboard in which data is only displayed in a plot after clicking in a map?

放肆的年华 提交于 2020-11-29 14:01:51

问题


I am trying to scrape data from this public Tableau dashboard. The ineterest is in the time series plotted data. If i click in a spcific state in the map, the time series changes to that specific state. Following this and this posts I got the results for the time series aggregated at the country-level (with the code provided below). But my interest is in a state-level data.

import requests
from bs4 import BeautifulSoup
import json
import re

# get the second tableau link
r = requests.get(
    f"https://public.tableau.com/views/MKTScoredeisolamentosocial/VisoGeral",
    params= {
        ":showVizHome":"no"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'https://public.tableau.com{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],

})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))


print(data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"])

I researched about Tableau categories and found out that some parameters can be inserted in the URL to get desirible results, but I was unable to find such parameters. I noticed that the data I want is stored in a worksheet named "time_line_BR", where BR stands for Brazil. But I would like to change this for the states, e.g. São Paulo (SP). I also noted some parameters in tableauData, like "current_view_id", that I suspect can be related to the data being loaded in the time series.

Is is possible to post a request where the data pulled is the same as the one I see in the plots when I manually select a specific state?


回答1:


When you click on the map, it triggers a call on :

POST https://public.tableau.com/{vizql_root}/sessions/{session_id}/commands/tabdoc/select

with some form data like the following :

worksheet: map_state_mobile
dashboard: Visão Geral
selection: {"objectIds":[17],"selectionType":"tuples"}
selectOptions: select-options-simple

It has the state index (here 17) and the worksheet name. I've noticed that the sheet name is either map_state_mobile or map_state (2) when you click a state.

So, it's necessary to :

  • get the state name list to pick a correct index for the state to choose
  • make a call the API above to select the state and extract the data

Extract the field values (state names)

The state are sorted alphabetically (reversed) so the method below is not necessary if you are ok with hardcoding them and sort them like this :

['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']

In other case when we don't want to harcode them (for other tableau usecase), execute the method below :

Extracting the state name list is not straightforward since the data is presented as following :

{
     "secondaryInfo": {
         "presModelMap": {
            "dataDictionary": {
                "presModelHolder": {
                    "genDataDictionaryPresModel": {
                        "dataSegments": {
                            "0": {
                                "dataColumns": []
                            }
                        }
                    }
                }
            },
             "vizData": {
                     "presModelHolder": {
                         "genPresModelMapPresModel": {
                             "presModelMap": {
                                 "map_state (2)": {},
                                 "map_state_mobile": {},
                                 "time_line_BR": {},
                                 "time_line_BR_mobile": {},
                                 "total de casos": {},
                                 "total de mortes": {}
                             }
                         }
                     }
             }
         }
     }
}

My method is to get into "vizData" and into a worksheet inside presModelMap which has the following structure :

"presModelHolder": {
    "genVizDataPresModel": {
        "vizColumns": [],
        "paneColumnsData": {
            "vizDataColumns": [],
            "paneColumnsList": []
        }
    }
}

vizDataColumns has a collection of object with property localBaseColumnName. Find the localBaseColumnName with value [state_name] with fieldRole as measure :

{
    "fn": "[federated.124ags61tmhyti14im1010h1elsu].[attr:state_name:nk]",
    "fnDisagg": "",
    "localBaseColumnName": "[state_name]", <============================= MATCH THIS
    "baseColumnName": "[federated.124ags61tmhyti14im1010h1elsu].[state_name]",
    "fieldCaption": "ATTR(State Name)",
    "formatStrings": [],
    "datasourceCaption": "federated.124ags61tmhyti14im1010h1elsu",
    "dataType": "cstring",
    "aggregation": "attr",
    "stringCollation": {
        "name": "LEN_RUS_S2",
        "charsetId": 0
    },
    "fieldRole": "measure", <=========================================== MATCH THIS
    "isAutoSelect": true,
    "paneIndices": [
        0  <=========================================== EXTRACT THIS
    ],
    "columnIndices": [
        7  <=========================================== EXTRACT THIS
    ]
} 

paneIndices match the index in the paneColumnsList array. And columnIndices match the index in the vizPaneColumns array. vizPaneColumns array is located just in the item selected in paneColumnsList array

From there you get the index to search like this :

[222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248]

In the dataDictionary object, get the dataValues (like you've extracted in your question) and extract the state name from the range above

Then you get the state list :

['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']

Call the select endpoint

You just need the worksheet name and the index of the field (state index in the list above)

r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
    data = {
    "worksheet": "map_state (2)",
    "dashboard": "Visão Geral",
    "selection": json.dumps({
        "objectIds":[int(selected_index)],
        "selectionType":"tuples"
    }),
    "selectOptions": "select-options-simple"
})

The code below extract the tableau data, extract the state name with the method above (not necessary if you prefer to hardcode the list), prompt user to enter state index, call the select endpoint and extract the data for this state :

import requests
from bs4 import BeautifulSoup
import json
import re

data_host = "https://public.tableau.com"

# get the second tableau link
r = requests.get(
    f"{data_host}/views/MKTScoredeisolamentosocial/VisoGeral",
    params= {
        ":showVizHome":"no"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})

dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
info = json.loads(dataReg.group(1))
data = json.loads(dataReg.group(2))

stateIndexInfo = [ 
    (t["fieldRole"], {
        "paneIndices": t["paneIndices"][0], 
        "columnIndices": t["columnIndices"][0], 
        "dataType": t["dataType"]
    }) 
    for t in data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"]["map_state (2)"]["presModelHolder"]["genVizDataPresModel"]["paneColumnsData"]["vizDataColumns"]
    if t.get("localBaseColumnName") and t["localBaseColumnName"] == "[state_name]"
]

stateNameIndexInfo = [t[1] for t in stateIndexInfo if t[0] == 'dimension'][0]

panelColumnList = data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"]["map_state (2)"]["presModelHolder"]["genVizDataPresModel"]["paneColumnsData"]["paneColumnsList"]
stateNameIndices = panelColumnList[stateNameIndexInfo["paneIndices"]]["vizPaneColumns"][stateNameIndexInfo["columnIndices"]]["valueIndices"]

# print [222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248]
#print(stateNameIndices)

dataValues = [
    t
    for t in data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"]
    if t["dataType"] == stateNameIndexInfo["dataType"]
][0]["dataValues"]

stateNames = [dataValues[t] for t in stateNameIndices]

# print ['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']
#print(stateNames)

for idx, val in enumerate(stateNames):
    print(f"{val} - {idx+1}")

selected_index = input("Please select a state by indices : ")
print(f"selected : {stateNames[int(selected_index)-1]}")

r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
    data = {
    "worksheet": "map_state (2)",
    "dashboard": "Visão Geral",
    "selection": json.dumps({
        "objectIds":[int(selected_index)],
        "selectionType":"tuples"
    }),
    "selectOptions": "select-options-simple"
})

dataSegments = r.json()["vqlCmdResponse"]["layoutStatus"]["applicationPresModel"]["dataDictionary"]["dataSegments"]
print(dataSegments[max([*dataSegments])]["dataColumns"])

Try this on repl.it

The code with hardcoding of the state name list is more straightforward :

import requests
from bs4 import BeautifulSoup
import json

data_host = "https://public.tableau.com"

r = requests.get(
    f"{data_host}/views/MKTScoredeisolamentosocial/VisoGeral",
    params= {
        ":showVizHome":"no"
    }
)
soup = BeautifulSoup(r.text, "html.parser")
tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
r = requests.post(dataUrl, data= {
    "sheet_id": tableauData["sheetId"],
})
stateNames = ['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']

for idx, val in enumerate(stateNames):
    print(f"{val} - {idx+1}")

selected_index = input("Please select a state by indices : ")
print(f"selected : {stateNames[int(selected_index)-1]}")

r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
    data = {
    "worksheet": "map_state (2)",
    "dashboard": "Visão Geral",
    "selection": json.dumps({
        "objectIds":[int(selected_index)],
        "selectionType":"tuples"
    }),
    "selectOptions": "select-options-simple"
})

dataSegments = r.json()["vqlCmdResponse"]["layoutStatus"]["applicationPresModel"]["dataDictionary"]["dataSegments"]
print(dataSegments[max([*dataSegments])]["dataColumns"])

Try this on repl.it

Note that, in this case, even if we don't care about the output of the first call (/bootstrapSession/sessions/{tableauData["sessionid"]}). It's needed to validate the session_id and call the select call afterwards (otherwise the select doesn't return anything)



来源:https://stackoverflow.com/questions/63822827/how-to-scrape-a-tableau-dashboard-in-which-data-is-only-displayed-in-a-plot-afte

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!