How to scrape a Tableau dashboard in which data is only displayed in a plot after clicking in a map?

后端 未结 1 1700
北海茫月
北海茫月 2021-01-27 00:36

I am trying to scrape data from this public Tableau dashboard. The ineterest is in the time series plotted data. If i click in a spcific state in the map, the time series change

1条回答
  •  -上瘾入骨i
    2021-01-27 01:33

    When you click on the map, it triggers a call on :

    POST https://public.tableau.com/{vizql_root}/sessions/{session_id}/commands/tabdoc/select
    

    with some form data like the following :

    worksheet: map_state_mobile
    dashboard: Visão Geral
    selection: {"objectIds":[17],"selectionType":"tuples"}
    selectOptions: select-options-simple
    

    It has the state index (here 17) and the worksheet name. I've noticed that the sheet name is either map_state_mobile or map_state (2) when you click a state.

    So, it's necessary to :

    • get the state name list to pick a correct index for the state to choose
    • make a call the API above to select the state and extract the data

    Extract the field values (state names)

    The state are sorted alphabetically (reversed) so the method below is not necessary if you are ok with hardcoding them and sort them like this :

    ['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']
    

    In other case when we don't want to harcode them (for other tableau usecase), execute the method below :

    Extracting the state name list is not straightforward since the data is presented as following :

    {
         "secondaryInfo": {
             "presModelMap": {
                "dataDictionary": {
                    "presModelHolder": {
                        "genDataDictionaryPresModel": {
                            "dataSegments": {
                                "0": {
                                    "dataColumns": []
                                }
                            }
                        }
                    }
                },
                 "vizData": {
                         "presModelHolder": {
                             "genPresModelMapPresModel": {
                                 "presModelMap": {
                                     "map_state (2)": {},
                                     "map_state_mobile": {},
                                     "time_line_BR": {},
                                     "time_line_BR_mobile": {},
                                     "total de casos": {},
                                     "total de mortes": {}
                                 }
                             }
                         }
                 }
             }
         }
    }
    

    My method is to get into "vizData" and into a worksheet inside presModelMap which has the following structure :

    "presModelHolder": {
        "genVizDataPresModel": {
            "vizColumns": [],
            "paneColumnsData": {
                "vizDataColumns": [],
                "paneColumnsList": []
            }
        }
    }
    

    vizDataColumns has a collection of object with property localBaseColumnName. Find the localBaseColumnName with value [state_name] with fieldRole as measure :

    {
        "fn": "[federated.124ags61tmhyti14im1010h1elsu].[attr:state_name:nk]",
        "fnDisagg": "",
        "localBaseColumnName": "[state_name]", <============================= MATCH THIS
        "baseColumnName": "[federated.124ags61tmhyti14im1010h1elsu].[state_name]",
        "fieldCaption": "ATTR(State Name)",
        "formatStrings": [],
        "datasourceCaption": "federated.124ags61tmhyti14im1010h1elsu",
        "dataType": "cstring",
        "aggregation": "attr",
        "stringCollation": {
            "name": "LEN_RUS_S2",
            "charsetId": 0
        },
        "fieldRole": "measure", <=========================================== MATCH THIS
        "isAutoSelect": true,
        "paneIndices": [
            0  <=========================================== EXTRACT THIS
        ],
        "columnIndices": [
            7  <=========================================== EXTRACT THIS
        ]
    } 
    

    paneIndices match the index in the paneColumnsList array. And columnIndices match the index in the vizPaneColumns array. vizPaneColumns array is located just in the item selected in paneColumnsList array

    From there you get the index to search like this :

    [222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248]
    

    In the dataDictionary object, get the dataValues (like you've extracted in your question) and extract the state name from the range above

    Then you get the state list :

    ['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']
    

    Call the select endpoint

    You just need the worksheet name and the index of the field (state index in the list above)

    r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
        data = {
        "worksheet": "map_state (2)",
        "dashboard": "Visão Geral",
        "selection": json.dumps({
            "objectIds":[int(selected_index)],
            "selectionType":"tuples"
        }),
        "selectOptions": "select-options-simple"
    })
    

    The code below extract the tableau data, extract the state name with the method above (not necessary if you prefer to hardcode the list), prompt user to enter state index, call the select endpoint and extract the data for this state :

    import requests
    from bs4 import BeautifulSoup
    import json
    import re
    
    data_host = "https://public.tableau.com"
    
    # get the second tableau link
    r = requests.get(
        f"{data_host}/views/MKTScoredeisolamentosocial/VisoGeral",
        params= {
            ":showVizHome":"no"
        }
    )
    soup = BeautifulSoup(r.text, "html.parser")
    tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
    dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
    r = requests.post(dataUrl, data= {
        "sheet_id": tableauData["sheetId"],
    })
    
    dataReg = re.search('\d+;({.*})\d+;({.*})', r.text, re.MULTILINE)
    info = json.loads(dataReg.group(1))
    data = json.loads(dataReg.group(2))
    
    stateIndexInfo = [ 
        (t["fieldRole"], {
            "paneIndices": t["paneIndices"][0], 
            "columnIndices": t["columnIndices"][0], 
            "dataType": t["dataType"]
        }) 
        for t in data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"]["map_state (2)"]["presModelHolder"]["genVizDataPresModel"]["paneColumnsData"]["vizDataColumns"]
        if t.get("localBaseColumnName") and t["localBaseColumnName"] == "[state_name]"
    ]
    
    stateNameIndexInfo = [t[1] for t in stateIndexInfo if t[0] == 'dimension'][0]
    
    panelColumnList = data["secondaryInfo"]["presModelMap"]["vizData"]["presModelHolder"]["genPresModelMapPresModel"]["presModelMap"]["map_state (2)"]["presModelHolder"]["genVizDataPresModel"]["paneColumnsData"]["paneColumnsList"]
    stateNameIndices = panelColumnList[stateNameIndexInfo["paneIndices"]]["vizPaneColumns"][stateNameIndexInfo["columnIndices"]]["valueIndices"]
    
    # print [222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248]
    #print(stateNameIndices)
    
    dataValues = [
        t
        for t in data["secondaryInfo"]["presModelMap"]["dataDictionary"]["presModelHolder"]["genDataDictionaryPresModel"]["dataSegments"]["0"]["dataColumns"]
        if t["dataType"] == stateNameIndexInfo["dataType"]
    ][0]["dataValues"]
    
    stateNames = [dataValues[t] for t in stateNameIndices]
    
    # print ['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']
    #print(stateNames)
    
    for idx, val in enumerate(stateNames):
        print(f"{val} - {idx+1}")
    
    selected_index = input("Please select a state by indices : ")
    print(f"selected : {stateNames[int(selected_index)-1]}")
    
    r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
        data = {
        "worksheet": "map_state (2)",
        "dashboard": "Visão Geral",
        "selection": json.dumps({
            "objectIds":[int(selected_index)],
            "selectionType":"tuples"
        }),
        "selectOptions": "select-options-simple"
    })
    
    dataSegments = r.json()["vqlCmdResponse"]["layoutStatus"]["applicationPresModel"]["dataDictionary"]["dataSegments"]
    print(dataSegments[max([*dataSegments])]["dataColumns"])
    

    Try this on repl.it

    The code with hardcoding of the state name list is more straightforward :

    import requests
    from bs4 import BeautifulSoup
    import json
    
    data_host = "https://public.tableau.com"
    
    r = requests.get(
        f"{data_host}/views/MKTScoredeisolamentosocial/VisoGeral",
        params= {
            ":showVizHome":"no"
        }
    )
    soup = BeautifulSoup(r.text, "html.parser")
    tableauData = json.loads(soup.find("textarea",{"id": "tsConfigContainer"}).text)
    dataUrl = f'{data_host}{tableauData["vizql_root"]}/bootstrapSession/sessions/{tableauData["sessionid"]}'
    r = requests.post(dataUrl, data= {
        "sheet_id": tableauData["sheetId"],
    })
    stateNames = ['Tocantins', 'Sergipe', 'São Paulo', 'Santa Catarina', 'Roraima', 'Rondônia', 'Rio Grande do Sul', 'Rio Grande do Norte', 'Rio de Janeiro', 'Piauí', 'Pernambuco', 'Paraná', 'Paraíba', 'Pará', 'Minas Gerais', 'Mato Grosso do Sul', 'Mato Grosso', 'Maranhão', 'Goiás', 'Espírito Santo', 'Distrito Federal', 'Ceará', 'Bahia', 'Amazonas', 'Amapá', 'Alagoas', 'Acre']
    
    for idx, val in enumerate(stateNames):
        print(f"{val} - {idx+1}")
    
    selected_index = input("Please select a state by indices : ")
    print(f"selected : {stateNames[int(selected_index)-1]}")
    
    r = requests.post(f'{data_host}{tableauData["vizql_root"]}/sessions/{tableauData["sessionid"]}/commands/tabdoc/select',
        data = {
        "worksheet": "map_state (2)",
        "dashboard": "Visão Geral",
        "selection": json.dumps({
            "objectIds":[int(selected_index)],
            "selectionType":"tuples"
        }),
        "selectOptions": "select-options-simple"
    })
    
    dataSegments = r.json()["vqlCmdResponse"]["layoutStatus"]["applicationPresModel"]["dataDictionary"]["dataSegments"]
    print(dataSegments[max([*dataSegments])]["dataColumns"])
    

    Try this on repl.it

    Note that, in this case, even if we don't care about the output of the first call (/bootstrapSession/sessions/{tableauData["sessionid"]}). It's needed to validate the session_id and call the select call afterwards (otherwise the select doesn't return anything)

    0 讨论(0)
提交回复
热议问题