DATEXII XML file to DataFrame in Python

前端 未结 1 397
梦毁少年i
梦毁少年i 2021-01-07 13:38

The last couple of days I have been trying to open and read a certain XML file (in DATEXII format), but have not succeeded so far. It is about traffic data from the NDW Open

相关标签:
1条回答
  • 2021-01-07 14:05

    Consider transforming your nested XML input source into a flatter structure using XSLT the special-purpose transformation language designed to transform XML files into other XML, HTML, even text (CSV/TAB). Therefore, consider the below XSLT that transforms original XML into comma-separated values in tabular format for import into pandas with read_csv():

    XSLT (save as .xsl file, a special xml file)

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                  xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
                                  xmlns:pub="http://datex2.eu/schema/2/2_0"
                                  xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
      <xsl:output method="text"/>
      <xsl:strip-space elements="*"/>
    
      <xsl:template match="/soapenv:Envelope">
        <xsl:text>publicationTime,country,nationalIdentifier,msmtSiteTableRef_targetClass,msmtSiteTableRef_version,msmtSiteTableRef_id,</xsl:text>
        <xsl:text>msmtSiteRef_targetClass,msmtSiteRef_version,msmtSiteRef_id,measurementTimeDefault,</xsl:text>
        <xsl:text>measuredValue_index,basicData_type,vehicleFlowRate,averageVehicleSpeed_numberOfInputValues,averageVehicleSpeed_value</xsl:text>
        <xsl:text>&#xa;</xsl:text>
        <xsl:apply-templates select="soapenv:Body"/>
      </xsl:template>
    
      <xsl:template match="soapenv:Body">
        <xsl:apply-templates select="d2LogicalModel"/>
      </xsl:template>
    
      <xsl:template match="d2LogicalModel">
        <xsl:apply-templates select="pub:payloadPublication"/>
      </xsl:template>
    
      <xsl:template match="pub:payloadPublication">
        <xsl:apply-templates select="pub:siteMeasurements"/>
      </xsl:template>
    
      <xsl:template match="pub:siteMeasurements">
        <xsl:apply-templates select="pub:measuredValue"/>
      </xsl:template>
    
      <xsl:template match="pub:measuredValue">
        <xsl:value-of select="concat(ancestor::pub:payloadPublication/pub:publicationTime,',',
                                     ancestor::pub:payloadPublication/pub:publicationCreator/pub:country,',',
                                     ancestor::pub:payloadPublication/pub:publicationCreator/pub:nationalIdentifier,',',
                                     ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@targetClass,',',
                                     ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@version,',',
                                     ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@id,',',
                                     ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@targetClass,',',
                                     ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@version,',',
                                     ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@id,',',
                                     ancestor::pub:siteMeasurements/pub:measurementTimeDefault,',',
                                     @index,',',
                                     pub:measuredValue/pub:basicData/@xsi:type,',',
                                     descendant::pub:vehicleFlowRate,',',
                                     descendant::pub:averageVehicleSpeed/@numberOfInputValuesUsed,',',
                                     descendant::pub:speed)"/><xsl:text>&#xa;</xsl:text>    
      </xsl:template>
    
    </xsl:stylesheet>
    

    Python

    from io import StringIO
    import lxml.etree as et
    import pandas as pd
    
    # LOAD XML AND XSL FILES
    doc = et.parse('/path/to/Input.xml')
    xsl = et.parse('/path/to/XSLT.xsl')
    
    # INITIALIZE AND RUN TRANSFORMATION
    transform = et.XSLT(xsl)
    # CONVERT RESULT TO STRING 
    result = str(transform(doc))
    
    # IMPORT INTO DATAFRAME
    df = pd.read_csv(StringIO(result))
    

    Output (parent node values become repeated indicators with different numeric data)

    print(df)
    
    #           publicationTime country nationalIdentifier msmtSiteTableRef_targetClass  msmtSiteTableRef_version msmtSiteTableRef_id msmtSiteRef_targetClass  msmtSiteRef_version     msmtSiteRef_id measurementTimeDefault  measuredValue_index basicData_type  vehicleFlowRate  averageVehicleSpeed_numberOfInputValues  averageVehicleSpeed_value
    # 0  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    1    TrafficFlow             60.0                                      NaN                        NaN
    # 1  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    2    TrafficFlow              0.0                                      NaN                        NaN
    # 2  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    3    TrafficFlow              0.0                                      NaN                        NaN
    # 3  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    4    TrafficFlow             60.0                                      NaN                        NaN
    # 4  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    5   TrafficSpeed              NaN                                      1.0                       38.0
    # 5  20171030T05:00:40.007Z      nl              NLNDW         MeasurementSiteTable                       955            NDW01_MT   MeasurementSiteRecord                    1  PZH01_MST_0690_00     20171030T04:59:00Z                    6   TrafficSpeed              NaN                                      0.0                        1.0
    
    0 讨论(0)
提交回复
热议问题