Converting very large files from xml to csv

≡放荡痞女 提交于 2019-12-23 03:29:21

问题


Currently I'm using the following code snippet to convert a .txt file with XML data to .CSV format. My question is this, currently this works perfectly with files that are around 100-200 mbs and the conversion time is very low (1-2 minutes max), However I now need this to work for much bigger files (1-2 GB's each file). Currently the program freezes the computer and the conversion takes about 30-40 minutes with this function. Not sure how I would proceed changing this function. Any help will be appreciated!

    string all_lines = File.ReadAllText(p);

    all_lines = "<Root>" + all_lines + "</Root>";
    XmlDocument doc_all = new XmlDocument();
    doc_all.LoadXml(all_lines);
    StreamWriter write_all = new StreamWriter(FILENAME1);
    XmlNodeList rows_all = doc_all.GetElementsByTagName("XML");

    foreach (XmlNode rowtemp in rows_all)
    {
        List<string> children_all = new List<string>();
        foreach (XmlNode childtemp in rowtemp.ChildNodes)
        {
            children_all.Add(Regex.Replace(childtemp.InnerText, "\\s+", " "));             
        }
        write_all.WriteLine(string.Join(",", children_all.ToArray()));
    }
    write_all.Flush();
    write_all.Close();

Sample Input::

 <XML><DSTATUS>1,4,7,,5</DSTATUS><EVENT> hello,there,my,name,is,jack,</EVENT>
     last,name,missing,above <ANOTHERTAG>3,6,7,,8,4</ANOTHERTAG> </XML>

 <XML><DSTATUS>1,5,7,,3</DSTATUS><EVENT>hello,there,my,name,is,mary,jane</EVENT>
     last,name,not,missing,above<ANOTHERTAG>3,6,7,,8,4</ANOTHERTAG></XML>

Sample Output::

1,4,7,,5,hello,there,my,name,is,jack,,last,name,missing,above,3,6,7,,8,4
1,5,7,,3,hello,there,my,name,is,mary,jane,last,name,not,missing,above,3,6,7,,8,4

回答1:


You need to take a streaming approach, as you're currently reading the entire 2Gb file into memory and then processing it. You should read a bit of XML, write a bit of CSV and keep doing that until you've processed it all.

A possible solution is below:

using (var writer = new StreamWriter(FILENAME1))
{
    foreach (var element in StreamElements(r, "XML"))
    {
        var values = element.DescendantNodes()
            .OfType<XText>()
            .Select(e => Regex.Replace(e.Value, "\\s+", " "));

        var line = string.Join(",", values);

        writer.WriteLine(line);
    }
}

Where StreamElements is inspired by Jon Skeet's streaming of XElements from an XmlReader in an answer to this question. I've made some changes to support your 'invalid' XML (as you have no root element):

private static IEnumerable<XElement> StreamElements(string fileName, string elementName)
{
    var settings = new XmlReaderSettings
    {
        ConformanceLevel = ConformanceLevel.Fragment
    };

    using (XmlReader reader = XmlReader.Create(fileName, settings))
    {
        while (reader.Read())
        {
            if (reader.NodeType == XmlNodeType.Element)
            {
                if (reader.Name == elementName)
                {
                    var el = XNode.ReadFrom(reader) as XElement;
                    if (el != null)
                    {
                        yield return el;
                    }
                }
            }
        }
    }
}



回答2:


If you're prepared to consider a completely different way of doing it, download Saxon-EE 9.6, get an evaluation license, and run the following streaming XSLT 3.0 code:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template name="main">
  <xsl:stream href="input.xml">
    <xsl:for-each select="*/*">
       <xsl:value-of select="*!normalize-space()" separator=","/>
       <xsl:text>&#xa;</xsl:text>
    </xsl:for-each>
  </xsl:stream>
</xsl:template>

</xsl:stylesheet>



回答3:


It freezes because of File.ReadAllText(p);

Do not read the complete file into memory. (This will first start swapping, then halt your CPU because no more memory is available)

Use a chunking approach: Read line by line, convert line by line, write line by line.

Use some lower level XML Reader class, not XmlDocument




回答4:


There are two variants. First is to hide program-freeze, use BackgroundWorker for it. Second: read your text file string-by-string, use any Reader for it (Xml or any text\file). You can combine these variants.



来源:https://stackoverflow.com/questions/30916104/converting-very-large-files-from-xml-to-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!