Extracting raw data from a PowerPivot model using Python

前端未结

关注

 3  1034

What seemed like a trivial task turned into a real nightmare when I had to read in some data from a PowerPivot model using Python. I believe I\'ve researched this very well over

相关标签:

3条回答

孤独总比滥情好

2021-01-31 20:59

Lo and behold, I finally managed to crack the problem - turns out that accessing Power Pivot data using Python is indeed possible! Below's a short recap of what I did - you can find a more detailed description here: Analysis Services (SSAS) on a shoestring. Note: the code has been optimized neither for efficiency nor elegance.

Install Microsoft Power BI Desktop (comes with free Analysis Services server, so no need for a costly SQL Server license - however, the same approach obviously also works if you have a proper license).
Fire up the AS engine by first creating the msmdsrv.ini settings file, then restore the database from the ABF file (using AMO.NET), then extract data using ADOMD.NET.

Here's the Python code that illustrates the AS engine + AMO.NET parts:

import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas

def initialSetup(pathPowerBI):
    sys.path.append(pathPowerBI)

    #required Analysis Services assemblies
    clr.AddReference("Microsoft.PowerBI.Amo.Core")
    clr.AddReference("Microsoft.PowerBI.Amo")     
    clr.AddReference("Microsoft.PowerBI.AdomdClient")

    global AMO, ADOMD
    import Microsoft.AnalysisServices as AMO
    import Microsoft.AnalysisServices.AdomdClient as ADOMD

def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):   
    #create random folder
    os.chdir(pathTarget)
    folder = os.getcwd()+str(random.randrange(10**6, 10**7))
    os.mkdir(folder)

    #extract PowerPivot model (abf backup)
    archive = zipfile.ZipFile(excelName)
    for member in archive.namelist():
        if ".data" in member:
            filename = os.path.basename(member)
            abfname = os.path.join(folder, filename) + ".abf"
            source = archive.open(member)
            target = file(os.path.join(folder, abfname), 'wb')
            shutil.copyfileobj(source, target)
            del target
    archive.close()

    #start the cmd.exe process to get its PID
    listPIDpre = [proc for proc in psutil.process_iter()]
    process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
    listPIDpost = [proc for proc in psutil.process_iter()]
    pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
    pid = str(pid).split("=")[1].split(",")[0]

    #msmdsrv.ini
    msmdsrvText = '''<ConfigurationSettings>
       <DataDir>{0}</DataDir>
       <TempDir>{0}</TempDir>
       <LogDir>{0}</LogDir>
       <BackupDir>{0}</BackupDir>
       <DeploymentMode>2</DeploymentMode>
       <RecoveryModel>1</RecoveryModel>
       <DisklessModeRequested>0</DisklessModeRequested>
       <CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
       <AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
       <Network>
          <Requests>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
          </Requests>
          <Responses>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
             <CompressionLevel>9</CompressionLevel>
          </Responses>
          <ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
       </Network>
       <Port>{1}</Port>
       <PrivateProcess>{2}</PrivateProcess>
       <InstanceVisible>0</InstanceVisible>
       <Language>1033</Language>
       <Debug>
          <CallStackInError>0</CallStackInError>
       </Debug>
       <Log>
          <Exception>
             <CrashReportsFolder>{0}</CrashReportsFolder>
          </Exception>
          <FlightRecorder>
             <Enabled>0</Enabled>
          </FlightRecorder>
       </Log>
       <AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
       <ResourceGovernance>
          <GovernIMBIScheduler>0</GovernIMBIScheduler>
       </ResourceGovernance>
       <Feature>
          <ManagedCodeEnabled>1</ManagedCodeEnabled>
       </Feature>
       <VertiPaq>
          <EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
          <EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
       </VertiPaq>
    </ConfigurationSettings>'''

    #save ini file to disk, fill it with required parameters
    msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
    msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
    msmdsrvini.write(msmdsrvText)
    msmdsrvini.close()

    #run AS engine inside the cmd.exe process
    initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
    initString = initString.format(pathPowerBI.replace("/","\\"),folder)
    process.stdin.write(initString + " \n")

    #connect to the AS instance from Python
    AMOServer = AMO.Server()
    AMOServer.Connect("localhost:{0}".format(port))

    #restore database from PowerPivot abf backup, disconnect
    AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
    AMOServer.Restore(AMORestoreInfo)
    AMOServer.Disconnect()

    return process

And the data-extraction part:

def runQuery(query, port, flag):
    #ADOMD assembly
    ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
    ADOMDConn.Open()
    ADOMDCommand = ADOMDConn.CreateCommand() 
    ADOMDCommand.CommandText = query

    #read data in via AdomdDataReader object
    DataReader = ADOMDCommand.ExecuteReader()

    #get metadata, number of columns
    SchemaTable = DataReader.GetSchemaTable()
    numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount

    #get column names
    columnNames = []
    for i in range(numCol):
        columnNames.append(str(SchemaTable.Rows[i][0]))

    #fill with data
    data = []
    while DataReader.Read()==True:
        row = []
        for j in range(numCol):
            try:
                row.append(DataReader[j].ToString())
            except:
                row.append(DataReader[j])
        data.append(row)
    df = pandas.DataFrame(data)
    df.columns = columnNames 

    if flag==0:
        DataReader.Close()
        ADOMDConn.Close()

        return df     
    else:   
        #metadata table
        metadataColumnNames = []
        for j in range(SchemaTable.Columns.Count):
            metadataColumnNames.append(SchemaTable.Columns[j].ToString())
        metadata = []
        for i in range(numCol):
            row = []
            for j in range(SchemaTable.Columns.Count):
                try:
                    row.append(SchemaTable.Rows[i][j].ToString())
                except:
                    row.append(SchemaTable.Rows[i][j])
            metadata.append(row)
        metadf = pandas.DataFrame(metadata)
        metadf.columns = metadataColumnNames

        DataReader.Close()
        ADOMDConn.Close()

        return df, metadf

The raw data are then extracted via something like this:

pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)

0 讨论(0)

离开以前

2021-01-31 21:07

The problem with getting data out of PowerPivot is that the tabular engine in PowerPivot runs in-process inside Excel and the only way to connect to that engine is to have your code running inside Excel too. (I suspect that it may use shared memory or some other transport, but it's definitely not listening on a TCP port or a named pipe or anything like that which would allow an external process to connect)

We do this in Dax Studio by running a C# VSTO Excel add-in in Excel. However that was only designed to work for testing analytic queries, not for doing bulk data extraction. We marshal the data across from the add-in to the UI using a string variable so the entire dataset must be less than 2Gb or the response gets truncated and you will see an "unrecognizable response" error (the data is serialized into an XMLA rowset which is quite verbose so may see it break when only extracting a few hundred Mb of data)

If you wanted to build a script to automate extracting all the raw data from a model I don't think you will be able to do it with Python as I don't believe you can get the python interpreter running in-process inside Excel. I would look at using a vba macro like this one http://www.powerpivotblog.nl/export-a-table-or-dax-query-from-power-pivot-to-csv-using-vba/

You should find that you can query the model for a list of tables with something like "SELECT * FROM $SYSTEM.DBSCHEMA_TABLES" - you could then loop over each table and extract with a variation of the code in the above link.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2021-01-31 21:19

I got in touch with Tom Gleeson (aka Gobán Saor) who was kind enough to let me post his emails here. There are some interesting nuggets in them, so hopefully others will also find them useful.

Email #1

When you say Python, you mean running Python.NET as a standalone exe? If that’s the case, you’re out of luck with Excel PP models (different story for Power BI desktop though). I’ve accessed PP models (2010+) successfully from both VBA, and from Python.NET (via AMO) using similar code to that in your SO question. The difference being (in both VBA & .NET version) is that my code is running in-process within Excel using Excel’s various add-in technologies. (Likely Tableau is also running as an add-in or has embedded Excel within itself enabling similar behaviour). DAX Studio (a useful C# code base to learn the how-tos of PP access) runs both as an Excel add-in and as a standalone EXE, but only as an add-in can it access Excel based PP models.

Email #2

You might find the process of using Python.NET for this somewhat challenging. You would need to embed a Python engine using C#/VB.NET Excel add-in code. I’ve used Excel-DNA (a fantastic open source project) rather than MS’s highly cumbersome "official" method for developing such .NET addins in the past, but I mainly stick to VBA where at all possible.

Using VBA you’ll not be able to access the .NET-only AMO (so no ability to create calculated columns on the fly), but by loading the resulting dataset into an ADO recordset you should be able to output to a worksheet OR to a corporate-database/MS Access OR to a flat-file/CSV etc.

Unlike the 1M worksheet limit, for a flat-file or database output memory (RAM) will be the limiting factor, but, assuming you’re using 64bit Excel and have enough memory to hold the compacted model and the workspace for the largest of the model’s tables in un-compacted form (i.e. a row based rather than column based format that’ll result from a DAX Query), multiplied by 2ish (one instance within PP workspace the other within VBA’s ADO workspace) you should be okay.

Having said that, I’ve never attempted extracting a very large dataset, and using models as a dataset exchange medium is not one of PP’s "use-cases"; so, very large tables might hit some other bug/constraint!

0 讨论(0)
发布评论:

提交评论
- 加载中...