The last couple of days I have been trying to open and read a certain XML file (in DATEXII format), but have not succeeded so far. It is about traffic data from the NDW Open
Consider transforming your nested XML input source into a flatter structure using XSLT the special-purpose transformation language designed to transform XML files into other XML, HTML, even text (CSV/TAB). Therefore, consider the below XSLT that transforms original XML into comma-separated values in tabular format for import into pandas with read_csv()
:
XSLT (save as .xsl file, a special xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:pub="http://datex2.eu/schema/2/2_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/soapenv:Envelope">
<xsl:text>publicationTime,country,nationalIdentifier,msmtSiteTableRef_targetClass,msmtSiteTableRef_version,msmtSiteTableRef_id,</xsl:text>
<xsl:text>msmtSiteRef_targetClass,msmtSiteRef_version,msmtSiteRef_id,measurementTimeDefault,</xsl:text>
<xsl:text>measuredValue_index,basicData_type,vehicleFlowRate,averageVehicleSpeed_numberOfInputValues,averageVehicleSpeed_value</xsl:text>
<xsl:text>
</xsl:text>
<xsl:apply-templates select="soapenv:Body"/>
</xsl:template>
<xsl:template match="soapenv:Body">
<xsl:apply-templates select="d2LogicalModel"/>
</xsl:template>
<xsl:template match="d2LogicalModel">
<xsl:apply-templates select="pub:payloadPublication"/>
</xsl:template>
<xsl:template match="pub:payloadPublication">
<xsl:apply-templates select="pub:siteMeasurements"/>
</xsl:template>
<xsl:template match="pub:siteMeasurements">
<xsl:apply-templates select="pub:measuredValue"/>
</xsl:template>
<xsl:template match="pub:measuredValue">
<xsl:value-of select="concat(ancestor::pub:payloadPublication/pub:publicationTime,',',
ancestor::pub:payloadPublication/pub:publicationCreator/pub:country,',',
ancestor::pub:payloadPublication/pub:publicationCreator/pub:nationalIdentifier,',',
ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@targetClass,',',
ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@version,',',
ancestor::pub:payloadPublication/pub:measurementSiteTableReference/@id,',',
ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@targetClass,',',
ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@version,',',
ancestor::pub:payloadPublication/pub:siteMeasurements/pub:measurementSiteReference/@id,',',
ancestor::pub:siteMeasurements/pub:measurementTimeDefault,',',
@index,',',
pub:measuredValue/pub:basicData/@xsi:type,',',
descendant::pub:vehicleFlowRate,',',
descendant::pub:averageVehicleSpeed/@numberOfInputValuesUsed,',',
descendant::pub:speed)"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Python
from io import StringIO
import lxml.etree as et
import pandas as pd
# LOAD XML AND XSL FILES
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/XSLT.xsl')
# INITIALIZE AND RUN TRANSFORMATION
transform = et.XSLT(xsl)
# CONVERT RESULT TO STRING
result = str(transform(doc))
# IMPORT INTO DATAFRAME
df = pd.read_csv(StringIO(result))
Output (parent node values become repeated indicators with different numeric data)
print(df)
# publicationTime country nationalIdentifier msmtSiteTableRef_targetClass msmtSiteTableRef_version msmtSiteTableRef_id msmtSiteRef_targetClass msmtSiteRef_version msmtSiteRef_id measurementTimeDefault measuredValue_index basicData_type vehicleFlowRate averageVehicleSpeed_numberOfInputValues averageVehicleSpeed_value
# 0 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 1 TrafficFlow 60.0 NaN NaN
# 1 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 2 TrafficFlow 0.0 NaN NaN
# 2 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 3 TrafficFlow 0.0 NaN NaN
# 3 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 4 TrafficFlow 60.0 NaN NaN
# 4 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 5 TrafficSpeed NaN 1.0 38.0
# 5 20171030T05:00:40.007Z nl NLNDW MeasurementSiteTable 955 NDW01_MT MeasurementSiteRecord 1 PZH01_MST_0690_00 20171030T04:59:00Z 6 TrafficSpeed NaN 0.0 1.0