Extracting raw data from a PowerPivot model using Python

前端 未结 3 1028
名媛妹妹
名媛妹妹 2021-01-31 20:53

What seemed like a trivial task turned into a real nightmare when I had to read in some data from a PowerPivot model using Python. I believe I\'ve researched this very well over

相关标签:
3条回答
  • 2021-01-31 20:59

    Lo and behold, I finally managed to crack the problem - turns out that accessing Power Pivot data using Python is indeed possible! Below's a short recap of what I did - you can find a more detailed description here: Analysis Services (SSAS) on a shoestring. Note: the code has been optimized neither for efficiency nor elegance.

    • Install Microsoft Power BI Desktop (comes with free Analysis Services server, so no need for a costly SQL Server license - however, the same approach obviously also works if you have a proper license).
    • Fire up the AS engine by first creating the msmdsrv.ini settings file, then restore the database from the ABF file (using AMO.NET), then extract data using ADOMD.NET.

    Here's the Python code that illustrates the AS engine + AMO.NET parts:

    import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas
    
    def initialSetup(pathPowerBI):
        sys.path.append(pathPowerBI)
    
        #required Analysis Services assemblies
        clr.AddReference("Microsoft.PowerBI.Amo.Core")
        clr.AddReference("Microsoft.PowerBI.Amo")     
        clr.AddReference("Microsoft.PowerBI.AdomdClient")
    
        global AMO, ADOMD
        import Microsoft.AnalysisServices as AMO
        import Microsoft.AnalysisServices.AdomdClient as ADOMD
    
    def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):   
        #create random folder
        os.chdir(pathTarget)
        folder = os.getcwd()+str(random.randrange(10**6, 10**7))
        os.mkdir(folder)
    
        #extract PowerPivot model (abf backup)
        archive = zipfile.ZipFile(excelName)
        for member in archive.namelist():
            if ".data" in member:
                filename = os.path.basename(member)
                abfname = os.path.join(folder, filename) + ".abf"
                source = archive.open(member)
                target = file(os.path.join(folder, abfname), 'wb')
                shutil.copyfileobj(source, target)
                del target
        archive.close()
    
        #start the cmd.exe process to get its PID
        listPIDpre = [proc for proc in psutil.process_iter()]
        process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
        listPIDpost = [proc for proc in psutil.process_iter()]
        pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
        pid = str(pid).split("=")[1].split(",")[0]
    
        #msmdsrv.ini
        msmdsrvText = '''<ConfigurationSettings>
           <DataDir>{0}</DataDir>
           <TempDir>{0}</TempDir>
           <LogDir>{0}</LogDir>
           <BackupDir>{0}</BackupDir>
           <DeploymentMode>2</DeploymentMode>
           <RecoveryModel>1</RecoveryModel>
           <DisklessModeRequested>0</DisklessModeRequested>
           <CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
           <AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
           <Network>
              <Requests>
                 <EnableBinaryXML>1</EnableBinaryXML>
                 <EnableCompression>1</EnableCompression>
              </Requests>
              <Responses>
                 <EnableBinaryXML>1</EnableBinaryXML>
                 <EnableCompression>1</EnableCompression>
                 <CompressionLevel>9</CompressionLevel>
              </Responses>
              <ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
           </Network>
           <Port>{1}</Port>
           <PrivateProcess>{2}</PrivateProcess>
           <InstanceVisible>0</InstanceVisible>
           <Language>1033</Language>
           <Debug>
              <CallStackInError>0</CallStackInError>
           </Debug>
           <Log>
              <Exception>
                 <CrashReportsFolder>{0}</CrashReportsFolder>
              </Exception>
              <FlightRecorder>
                 <Enabled>0</Enabled>
              </FlightRecorder>
           </Log>
           <AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
           <ResourceGovernance>
              <GovernIMBIScheduler>0</GovernIMBIScheduler>
           </ResourceGovernance>
           <Feature>
              <ManagedCodeEnabled>1</ManagedCodeEnabled>
           </Feature>
           <VertiPaq>
              <EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
              <EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
           </VertiPaq>
        </ConfigurationSettings>'''
    
        #save ini file to disk, fill it with required parameters
        msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
        msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
        msmdsrvini.write(msmdsrvText)
        msmdsrvini.close()
    
        #run AS engine inside the cmd.exe process
        initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
        initString = initString.format(pathPowerBI.replace("/","\\"),folder)
        process.stdin.write(initString + " \n")
    
        #connect to the AS instance from Python
        AMOServer = AMO.Server()
        AMOServer.Connect("localhost:{0}".format(port))
    
        #restore database from PowerPivot abf backup, disconnect
        AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
        AMOServer.Restore(AMORestoreInfo)
        AMOServer.Disconnect()
    
        return process
    

    And the data-extraction part:

    def runQuery(query, port, flag):
        #ADOMD assembly
        ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
        ADOMDConn.Open()
        ADOMDCommand = ADOMDConn.CreateCommand() 
        ADOMDCommand.CommandText = query
    
        #read data in via AdomdDataReader object
        DataReader = ADOMDCommand.ExecuteReader()
    
        #get metadata, number of columns
        SchemaTable = DataReader.GetSchemaTable()
        numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount
    
        #get column names
        columnNames = []
        for i in range(numCol):
            columnNames.append(str(SchemaTable.Rows[i][0]))
    
        #fill with data
        data = []
        while DataReader.Read()==True:
            row = []
            for j in range(numCol):
                try:
                    row.append(DataReader[j].ToString())
                except:
                    row.append(DataReader[j])
            data.append(row)
        df = pandas.DataFrame(data)
        df.columns = columnNames 
    
        if flag==0:
            DataReader.Close()
            ADOMDConn.Close()
    
            return df     
        else:   
            #metadata table
            metadataColumnNames = []
            for j in range(SchemaTable.Columns.Count):
                metadataColumnNames.append(SchemaTable.Columns[j].ToString())
            metadata = []
            for i in range(numCol):
                row = []
                for j in range(SchemaTable.Columns.Count):
                    try:
                        row.append(SchemaTable.Rows[i][j].ToString())
                    except:
                        row.append(SchemaTable.Rows[i][j])
                metadata.append(row)
            metadf = pandas.DataFrame(metadata)
            metadf.columns = metadataColumnNames
    
            DataReader.Close()
            ADOMDConn.Close()
    
            return df, metadf
    

    The raw data are then extracted via something like this:

    pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
    initialSetup(pathPowerBI)
    session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
    df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
    endSession(session)
    
    0 讨论(0)
  • 2021-01-31 21:07

    The problem with getting data out of PowerPivot is that the tabular engine in PowerPivot runs in-process inside Excel and the only way to connect to that engine is to have your code running inside Excel too. (I suspect that it may use shared memory or some other transport, but it's definitely not listening on a TCP port or a named pipe or anything like that which would allow an external process to connect)

    We do this in Dax Studio by running a C# VSTO Excel add-in in Excel. However that was only designed to work for testing analytic queries, not for doing bulk data extraction. We marshal the data across from the add-in to the UI using a string variable so the entire dataset must be less than 2Gb or the response gets truncated and you will see an "unrecognizable response" error (the data is serialized into an XMLA rowset which is quite verbose so may see it break when only extracting a few hundred Mb of data)

    If you wanted to build a script to automate extracting all the raw data from a model I don't think you will be able to do it with Python as I don't believe you can get the python interpreter running in-process inside Excel. I would look at using a vba macro like this one http://www.powerpivotblog.nl/export-a-table-or-dax-query-from-power-pivot-to-csv-using-vba/

    You should find that you can query the model for a list of tables with something like "SELECT * FROM $SYSTEM.DBSCHEMA_TABLES" - you could then loop over each table and extract with a variation of the code in the above link.

    0 讨论(0)
  • 2021-01-31 21:19

    I got in touch with Tom Gleeson (aka Gobán Saor) who was kind enough to let me post his emails here. There are some interesting nuggets in them, so hopefully others will also find them useful.

    Email #1

    When you say Python, you mean running Python.NET as a standalone exe? If that’s the case, you’re out of luck with Excel PP models (different story for Power BI desktop though). I’ve accessed PP models (2010+) successfully from both VBA, and from Python.NET (via AMO) using similar code to that in your SO question. The difference being (in both VBA & .NET version) is that my code is running in-process within Excel using Excel’s various add-in technologies. (Likely Tableau is also running as an add-in or has embedded Excel within itself enabling similar behaviour). DAX Studio (a useful C# code base to learn the how-tos of PP access) runs both as an Excel add-in and as a standalone EXE, but only as an add-in can it access Excel based PP models.

    Email #2

    You might find the process of using Python.NET for this somewhat challenging. You would need to embed a Python engine using C#/VB.NET Excel add-in code. I’ve used Excel-DNA (a fantastic open source project) rather than MS’s highly cumbersome "official" method for developing such .NET addins in the past, but I mainly stick to VBA where at all possible.

    Using VBA you’ll not be able to access the .NET-only AMO (so no ability to create calculated columns on the fly), but by loading the resulting dataset into an ADO recordset you should be able to output to a worksheet OR to a corporate-database/MS Access OR to a flat-file/CSV etc.

    Unlike the 1M worksheet limit, for a flat-file or database output memory (RAM) will be the limiting factor, but, assuming you’re using 64bit Excel and have enough memory to hold the compacted model and the workspace for the largest of the model’s tables in un-compacted form (i.e. a row based rather than column based format that’ll result from a DAX Query), multiplied by 2ish (one instance within PP workspace the other within VBA’s ADO workspace) you should be okay.

    Having said that, I’ve never attempted extracting a very large dataset, and using models as a dataset exchange medium is not one of PP’s "use-cases"; so, very large tables might hit some other bug/constraint!

    0 讨论(0)
提交回复
热议问题