问题
I´m trying to analize meteorological data, importing directly to R from it´s native structure in XML. But it seems to be a very complicated XML format not corresponding to the commonly used standard of "one observation per row". The provider of the data has grouped the variables by every ten minutes intervals registered.
Here is a piece of the XML code:
<?xml version= "1.0" encoding="ISO-8859-1" ?>
<mes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="C069_2018_1.xsd">
<dia Dia="2018-1-01">
<hora Hora="00:00">
<Meteoros>
<Dir.Med._a_1800cm>250.5</Dir.Med._a_1800cm>
<Humedad._a_170cm>43.94</Humedad._a_170cm>
<Irradia.._a_273cm>0.0</Irradia.._a_273cm>
<Precip.._a_144cm>0.0</Precip.._a_144cm>
<Sig.Dir._a_1800cm>17.82</Sig.Dir._a_1800cm>
<Sig.Vel._a_1800cm>2.78</Sig.Vel._a_1800cm>
<Tem.Aire._a_170cm>12.57</Tem.Aire._a_170cm>
<Vel.Max._a_1800cm>15.48</Vel.Max._a_1800cm>
<Vel.Med._a_1800cm>8.6</Vel.Med._a_1800cm>
</Meteoros>
</hora>
<hora Hora="00:10">
<Meteoros>
<Dir.Med._a_1800cm>249.3</Dir.Med._a_1800cm>
<Humedad._a_170cm>44.65</Humedad._a_170cm>
<Irradia.._a_273cm>0.0</Irradia.._a_273cm>
<Precip.._a_144cm>0.0</Precip.._a_144cm>
<Sig.Dir._a_1800cm>20.21</Sig.Dir._a_1800cm>
<Sig.Vel._a_1800cm>2.32</Sig.Vel._a_1800cm>
<Tem.Aire._a_170cm>12.55</Tem.Aire._a_170cm>
<Vel.Max._a_1800cm>14.5</Vel.Max._a_1800cm>
<Vel.Med._a_1800cm>7.8</Vel.Med._a_1800cm>
</Meteoros>
</hora>
<hora Hora="00:20">
<Meteoros>
<Dir.Med._a_1800cm>250.3</Dir.Med._a_1800cm>
<Humedad._a_170cm>46.17</Humedad._a_170cm>
<Irradia.._a_273cm>0.0</Irradia.._a_273cm>
<Precip.._a_144cm>0.0</Precip.._a_144cm>
<Sig.Dir._a_1800cm>23.02</Sig.Dir._a_1800cm>
<Sig.Vel._a_1800cm>2.25</Sig.Vel._a_1800cm>
<Tem.Aire._a_170cm>12.45</Tem.Aire._a_170cm>
<Vel.Max._a_1800cm>13.72</Vel.Max._a_1800cm>
<Vel.Med._a_1800cm>5.55</Vel.Med._a_1800cm>
</Meteoros>
</hora>
...
And here is the full XML for the data of january 2019 (>60 mb):
http://opendata.euskadi.eus/contenidos/ds_meteorologicos/met_stations_ds_2018/opendata/2018/C069/C069_2018_1.xml
When I used the function "xmlTreeParse" I got the error:
"Error: XML content does not seem to be XML"
It´s my first attempt with XML data structure, but I´ve been trying the recomendations of similar questions on this site as:
Transforming data from xml into R dataframe
xml to dataframe in r
R XML to Dataframe
But those seem to be simple XML structures that works just fine parsing directly or even converting directly to dataframes with the libraries "XML" and "methods"
I need to obtain a dataframe with similar structure to this:
dia hora Dir.Med._a_1800cm Humedad._a_170cm Irradia.._a_273cm Precip.._a_144cm Sig.Dir._a_1800cm Sig.Vel._a_1800cm Tem.Aire._a_170cm Vel.Max._a_1800cm Vel.Med._a_1800cm
01/01/2018 0:00 250.5 43.94 0.0 0.0 17.82 2.78 12.57 15.48 8.6
01/01/2018 0:10 249.3 44.65 0.0 0.0 20.21 2.32 12.55 14.5 7.8
01/01/2018 0:20 250.3 46.17 0.0 0.0 23.02 2.25 12.45 13.72 5.55
回答1:
It's quite some work but not impossible. This solution will also work with different number of observations per day.
dia <- xmlstr %>% read_xml() %>% xml_find_all("//dia")
dia.dat <- dia %>% map(xml_attrs) %>% map(~t(.) %>% as_tibble)
hora <- dia %>% map(xml_children)
hora.dat <- hora %>% map(xml_attrs) %>% map(~map_df(., ~t(.) %>% as_tibble))
hora.details <- hora %>%
map(~map(.,xml_children) %>%
map(xml_children) %>%
map(~setNames(xml_text(.), xml_name(.)) %>% t() %>% as_tibble)) %>% map(.,~do.call(rbind,.) %>% as_tibble)
pmap_df(list(dia.dat, hora.dat, hora.details),cbind)
I added some data to your xml example to test. (1 extra day and 2nd day an extra hour).
Result:
Dia Hora Dir.Med._a_1800cm Humedad._a_170cm Irradia.._a_273cm Precip.._a_144cm Sig.Dir._a_1800cm
1 2018-1-01 00:00 250.5 43.94 0.0 0.0 17.82
2 2018-1-01 00:10 249.3 44.65 0.0 0.0 20.21
3 2018-1-01 00:20 250.3 46.17 0.0 0.0 23.02
4 2018-1-02 00:00 250.5 43.94 0.0 0.0 17.82
5 2018-1-02 00:10 249.3 44.65 0.0 0.0 20.21
6 2018-1-02 00:20 250.3 46.17 0.0 0.0 23.02
7 2018-1-02 00:30 250.3 46.17 0.0 0.0 23.02
Sig.Vel._a_1800cm Tem.Aire._a_170cm Vel.Max._a_1800cm Vel.Med._a_1800cm
1 2.78 12.57 15.48 8.6
2 2.32 12.55 14.5 7.8
3 2.25 12.45 13.72 5.55
4 2.78 12.57 15.48 8.6
5 2.32 12.55 14.5 7.8
6 2.25 12.45 13.72 5.55
7 2.25 12.45 13.72 5.55
credits to answers of:
converting XML nodes to a dataframe
Getting all the children nodes of XML file to data.frame or data.table
回答2:
Consider XSLT, special-purpose language designed to transform XML files and sibling to XPath, to migrate the top level dia and hora nodes into Meteoros level regardless of number of nodes:
XSLT (save as .xsl, a special .xml file)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- REPEAT dia and hora NODES -->
<xsl:template match="Meteoros">
<xsl:copy>
<dia><xsl:value-of select="ancestor::dia/@Dia"/></dia>
<hora><xsl:value-of select="ancestor::hora/@Hora"/></hora>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Online demo
R (no loops or mapping needed)
library(XML)
library(xslt)
doc <- read_xml("Import.xml", package = "xslt")
style <- read_xml("Script.xsl", package = "xslt")
new_xml <- xml_xslt(doc, style)
new_doc <- XML::xmlParse(new_xml)
meteoros_df <- XML::xmlToDataFrame(nodes=getNodeSet(new_doc, "//Meteoros"))
来源:https://stackoverflow.com/questions/57875654/meteorological-data-from-xml-to-dataframe-in-r