Import XML files to PostgreSQL

后端 未结 4 1533
隐瞒了意图╮
隐瞒了意图╮ 2020-12-01 04:33

I do have a lot of XML files I would like to import in the table xml_data:

create table xml_data(result xml);

To do this I hav

相关标签:
4条回答
  • 2020-12-01 05:24

    I've used tr to replace all newlines with space. This will create XML file with one line only. Such file I can import easily into one row using \copy.

    Obviously, this is not a good idea in case where you have multi-line values in XML. Fortunately, this is not my case.

    To import all XML files in folder you can use this bash script:

    #!/bin/sh
    FILES=/folder/with/xml/files/*.xml
    for f in $FILES
    do
      tr '\n' ' ' < $f > temp.xml
      psql -d database -h localhost -U usr -c '\copy xml_data from temp.xml'
    done
    
    0 讨论(0)
  • 2020-12-01 05:32

    Necromancing: For those that need a working example:

    DO $$
       DECLARE myxml xml;
    BEGIN
    
    myxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'));
    
    DROP TABLE IF EXISTS mytable;
    CREATE TEMP TABLE mytable AS 
    
    SELECT 
         (xpath('//ID/text()', x))[1]::text AS id
        ,(xpath('//Name/text()', x))[1]::text AS Name 
        ,(xpath('//RFC/text()', x))[1]::text AS RFC
        ,(xpath('//Text/text()', x))[1]::text AS Text
        ,(xpath('//Desc/text()', x))[1]::text AS Desc
    FROM unnest(xpath('//record', myxml)) x
    ;
    
    END$$;
    
    
    SELECT * FROM mytable;
    

    Or with less noise

    SELECT 
         (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
        ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name 
        ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
        ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
        ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
        ,myTempTable.myXmlColumn as myXmlElement
    FROM unnest(
        xpath
        (    '//record'
            ,XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'))
        )
    ) AS myTempTable(myXmlColumn)
    ;
    

    With this example XML file (MyData.xml):

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <data-set>
        <record>
            <ID>1</ID>
            <Name>A</Name>
            <RFC>RFC 1035[1]</RFC>
            <Text>Address record</Text>
            <Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>
        </record>
        <record>
            <ID>2</ID>
            <Name>NS</Name>
            <RFC>RFC 1035[1]</RFC>
            <Text>Name server record</Text>
            <Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>
        </record>
    </data-set>
    

    Note:
    MyData.xml needs to be in the PG_Data directory (the parent-directory of the pg_stat directory).
    e.g. /var/lib/postgresql/9.3/main/MyData.xml
    This requires PostGreSQL 9.1+

    Overall, you can achive it fileless, like this:

    SELECT 
         (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
        ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name 
        ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
        ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
        ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
        ,myTempTable.myXmlColumn as myXmlElement 
        -- Source: https://en.wikipedia.org/wiki/List_of_DNS_record_types
    FROM unnest(xpath('//record', 
     CAST('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <data-set>
        <record>
            <ID>1</ID>
            <Name>A</Name>
            <RFC>RFC 1035[1]</RFC>
            <Text>Address record</Text>
            <Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>
        </record>
        <record>
            <ID>2</ID>
            <Name>NS</Name>
            <RFC>RFC 1035[1]</RFC>
            <Text>Name server record</Text>
            <Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>
        </record>
    </data-set>
    ' AS xml)   
    )) AS myTempTable(myXmlColumn)
    ;
    

    Note that unlike in MS-SQL, xpath text() returns NULL on a NULL value, and not an empty string.
    If for whatever reason you need to explicitly check for the existence of NULL, you can use [not(@xsi:nil="true")], to which you need to pass an array of namespaces, because otherwise, you get an error (however, you can omit all namespaces but xsi).

    SELECT 
         (xpath('//xmlEncodeTest[1]/text()', myTempTable.myXmlColumn))[1]::text AS c1
    
        ,(
        xpath('//xmlEncodeTest[1][not(@xsi:nil="true")]/text()', myTempTable.myXmlColumn
        ,
        ARRAY[
            -- ARRAY['xmlns','http://www.w3.org/1999/xhtml'], -- defaultns
            ARRAY['xsi','http://www.w3.org/2001/XMLSchema-instance'],
            ARRAY['xsd','http://www.w3.org/2001/XMLSchema'],        
            ARRAY['svg','http://www.w3.org/2000/svg'],
            ARRAY['xsl','http://www.w3.org/1999/XSL/Transform']
        ]
        )
        )[1]::text AS c22
    
    
        ,(xpath('//nixda[1]/text()', myTempTable.myXmlColumn))[1]::text AS c2 
        --,myTempTable.myXmlColumn as myXmlElement
        ,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e
        ,xmlexists('//nixda[1]' PASSING BY REF myTempTable.myXmlColumn) AS c2e
        ,xmlexists('//xmlEncodeTestAbc[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1ea
    FROM unnest(xpath('//row', 
         CAST('<?xml version="1.0" encoding="utf-8"?>
        <table xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
          <row>
            <xmlEncodeTest xsi:nil="true" />
            <nixda>noob</nixda>
          </row>
        </table>
        ' AS xml)   
        )
    ) AS myTempTable(myXmlColumn)
    ;
    

    You can also check if a field is contained in an XML-text, by doing

     ,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e
    

    for example when you pass an XML-value to a stored-procedure/function for CRUD. (see above)

    Also, note that the correct way to pass a null-value in XML is <elementName xsi:nil="true" /> and not <elementName /> or nothing. There is no correct way to pass NULL in attributes (you can only omit the attribute, but then it gets difficult/slow to infer the number of columns and their names in a large dataset).

    e.g.

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <table>
        <row column1="a" column2="3" />
        <row column1="b" column2="4" column3="true" />
    </table>
    

    (is more compact, but very bad if you need to import it, especially if from XML-files with multiple GB of data - see a wonderful example of that in the stackoverflow data dump)

    SELECT 
         myTempTable.myXmlColumn
        ,(xpath('//@column1', myTempTable.myXmlColumn))[1]::text AS c1
        ,(xpath('//@column2', myTempTable.myXmlColumn))[1]::text AS c2
        ,(xpath('//@column3', myTempTable.myXmlColumn))[1]::text AS c3
        ,xmlexists('//@column3' PASSING BY REF myTempTable.myXmlColumn) AS c3e
        ,case when (xpath('//@column3', myTempTable.myXmlColumn))[1]::text is null then 1 else 0 end AS is_null 
    FROM unnest(xpath('//row', '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <table>
        <row column1="a" column2="3" />
        <row column1="b" column2="4" column3="true" />
    </table>'
    ))  AS myTempTable(myXmlColumn) 
    
    0 讨论(0)
  • 2020-12-01 05:33

    Extending @stefan-steiger's excellent answer, here is an example that extracts XML elements from child nodes that contain multiple siblings (e.g., multiple <synonym> elements, for a particular <synomyms> parent node).

    I encountered this issue with my data and searched quite a bit for a solution; his answer was the most helpful, to me.

    Example data file, hmdb_metabolites_test.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <hmdb>
    <metabolite>
      <accession>HMDB0000001</accession>
      <name>1-Methylhistidine</name>
      <synonyms>
        <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
        <synonym>1-Methylhistidine</synonym>
        <synonym>Pi-methylhistidine</synonym>
        <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
      </synonyms>
    </metabolite>
    <metabolite>
      <accession>HMDB0000002</accession>
      <name>1,3-Diaminopropane</name>
      <synonyms>
        <synonym>1,3-Propanediamine</synonym>
        <synonym>1,3-Propylenediamine</synonym>
        <synonym>Propane-1,3-diamine</synonym>
        <synonym>1,3-diamino-N-Propane</synonym>
      </synonyms>
    </metabolite>
    <metabolite>
      <accession>HMDB0000005</accession>
      <name>2-Ketobutyric acid</name>
      <synonyms>
        <synonym>2-Ketobutanoic acid</synonym>
        <synonym>2-Oxobutyric acid</synonym>
        <synonym>3-Methyl pyruvic acid</synonym>
        <synonym>alpha-Ketobutyrate</synonym>
      </synonyms>
    </metabolite>
    </hmdb>
    

    Aside: the original XML file had a URL in the Document Element

    <hmdb xmlns="http://www.hmdb.ca">
    

    that prevented xpath from parsing the data. It will run (without error messages), but the relation/table is empty:

    [hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/sql/hmdb_test.sql
    DO
     accession | name | synonym 
    -----------+------+---------
    

    Since the source file is 3.4GB, I decided to edit that line using sed:

    sed -i '2s/.*hmdb xmlns.*/<hmdb>/' hmdb_metabolites.xml
    

    [Adding the 2 (instructs sed to edit "line 2") also -- coincidentally, in this instance -- doubling the sed command execution speed.]


    My postgres data folder (PSQL: SHOW data_directory;) is

    /mnt/Vancouver/Programming/RDB/postgres/postgres/data
    

    so, as sudo, I needed to copy my XML data file there and chown it for use in PostgreSQL:

    sudo chown postgres:postgres /mnt/Vancouver/Programming/RDB/postgres/postgres/data/hmdb_metabolites_test.xml
    

    Script (hmdb_test.sql):

    DO $$DECLARE myxml xml;
    
    BEGIN
    
    myxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('hmdb_metabolites_test.xml'), 'UTF8'));
    
    DROP TABLE IF EXISTS mytable;
    
    -- CREATE TEMP TABLE mytable AS 
    CREATE TABLE mytable AS 
    SELECT 
        (xpath('//accession/text()', x))[1]::text AS accession
        ,(xpath('//name/text()', x))[1]::text AS name 
        -- The "synonym" child/subnode has many sibling elements, so we need to
        -- "unnest" them,otherwise we only retrieve the first synonym per record:
        ,unnest(xpath('//synonym/text()', x))::text AS synonym
    FROM unnest(xpath('//metabolite', myxml)) x
    ;
    
    END$$;
    
    -- select * from mytable limit 5;
    SELECT * FROM mytable;
    

    Execution, output (in PSQL):

    [hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/hmdb_test.sql
    
    accession  |        name        |                         synonym                          
    -------------+--------------------+----------------------------------------------------------
    HMDB0000001 | 1-Methylhistidine  | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid
    HMDB0000001 | 1-Methylhistidine  | 1-Methylhistidine
    HMDB0000001 | 1-Methylhistidine  | Pi-methylhistidine
    HMDB0000001 | 1-Methylhistidine  | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate
    HMDB0000002 | 1,3-Diaminopropane | 1,3-Propanediamine
    HMDB0000002 | 1,3-Diaminopropane | 1,3-Propylenediamine
    HMDB0000002 | 1,3-Diaminopropane | Propane-1,3-diamine
    HMDB0000002 | 1,3-Diaminopropane | 1,3-diamino-N-Propane
    HMDB0000005 | 2-Ketobutyric acid | 2-Ketobutanoic acid
    HMDB0000005 | 2-Ketobutyric acid | 2-Oxobutyric acid
    HMDB0000005 | 2-Ketobutyric acid | 3-Methyl pyruvic acid
    HMDB0000005 | 2-Ketobutyric acid | alpha-Ketobutyrate
    
    [hmdb_test]#
    
    0 讨论(0)
  • 2020-12-01 05:36

    I would try a different approach: read the XML file directly into variable inside a plpgsql function and proceed from there. Should be a lot faster and a lot more robust.

    CREATE OR REPLACE FUNCTION f_sync_from_xml()
      RETURNS boolean AS
    $BODY$
    DECLARE
        myxml    xml;
        datafile text := 'path/to/my_file.xml';
    BEGIN
       myxml := pg_read_file(datafile, 0, 100000000);  -- arbitrary 100 MB max.
    
       CREATE TEMP TABLE tmp AS
       SELECT (xpath('//some_id/text()', x))[1]::text AS id
       FROM   unnest(xpath('/xml/path/to/datum', myxml)) x;
       ...
    

    You need superuser privileges, and file must be local to the DB server, in an accessible directory.
    Complete code example with more explanation and links:

    • XML data to PostgreSQL database
    0 讨论(0)
提交回复
热议问题