Converting CSV to JSON in bash

前端 未结 9 1895
梦毁少年i
梦毁少年i 2021-02-04 03:12

Trying to convert a CSV file into a JSON

Here is two sample lines :

-21.3214077;55.4851413;Ruizia cordata
-21.3213078;55.4849803;Cossinia pinnata


        
相关标签:
9条回答
  • 2021-02-04 03:28

    Here is an article on the subject: https://infiniteundo.com/post/99336704013/convert-csv-to-json-with-jq

    It also uses JQ, but a bit different approach using split() and map().

    jq --slurp --raw-input \
       'split("\n") | .[1:] | map(split(";")) |
          map({
             "position": [.[0], .[1]],
             "taxo": {
                 "espece": .[2]
              }
          })' \
      input.csv > output.json
    

    It doesn't handle separator escaping, though.

    0 讨论(0)
  • 2021-02-04 03:34

    For completeness sake, Xidel together with some XQuery magic can do this too:

    xidel -s input.csv --xquery '
      {
        "occurrences":for $x in tokenize($raw,"\n") let $a:=tokenize($x,";") return {
          "position":[
            $a[1],
            $a[2]
          ],
          "taxo":{
            "espece":$a[3]
          }
        }
      }
    '
    
    {
      "occurrences": [
        {
          "position": ["-21.3214077", "55.4851413"],
          "taxo": {
            "espece": "Ruizia cordata"
          }
        },
        {
          "position": ["-21.3213078", "55.4849803"],
          "taxo": {
            "espece": "Cossinia pinnata"
          }
        }
      ]
    }
    
    0 讨论(0)
  • 2021-02-04 03:35

    The right tool for this job is jq.

    jq -Rsn '
      {"occurrences":
        [inputs
         | . / "\n"
         | (.[] | select(length > 0) | . / ";") as $input
         | {"position": [$input[0], $input[1]], "taxo": {"espece": $input[2]}}]}
    ' <se.csv
    

    emits, given your input:

    {
      "occurences": [
        {
          "position": [
            "-21.3214077",
            "55.4851413"
          ],
          "taxo": {
            "espece": "Ruizia cordata"
          }
        },
        {
          "position": [
            "-21.3213078",
            "55.4849803"
          ],
          "taxo": {
            "espece": "Cossinia pinnata"
          }
        }
      ]
    }
    

    By the way, a less-buggy version of your original script might look like:

    #!/usr/bin/env bash
    
    items=( )
    while IFS=';' read -r lat long pos _; do
      printf -v item '{ "position": [%s, %s], "taxo": {"espece": "%s"}}' "$lat" "$long" "$pos"
      items+=( "$item" )
    done <se.csv
    
    IFS=','
    printf '{"occurrences": [%s]}\n' "${items[*]}"
    

    Note:

    • There's absolutely no point using cat to pipe into a loop (and good reasons not to); thus, we're using a redirection (<) to open the file directly as the loop's stdin.
    • read can be passed a list of destination variables; there's thus no need to read into an array (or first to read into a string, and then to generate a heresting and to read from that into an array). The _ at the end ensures that extra columns are discarded (by putting them into the dummy variable named _) rather than appended to pos.
    • "${array[*]}" generates a string by concatenating elements of array with the character in IFS; we can thus use this to ensure that commas are present in the output only when they're needed.
    • printf is used in preference to echo, as advised in the APPLICATION USAGE section of the specification for echo itself.
    • This is still inherently buggy since it's generating JSON via string concatenation. Don't use it.
    0 讨论(0)
  • 2021-02-04 03:35

    John Kerl's Miller tool has this built-in:

    mlr --c2j --jlistwrap cat INPUT.csv > OUTPUT.json
    
    0 讨论(0)
  • 2021-02-04 03:36

    If you want to go crazy, you can write a parser using jq. Here's my implementation which can be thought of as the inverse of the @csv filter. Throw this into your .jq file.

    def do_if(pred; update):
        if pred then update else . end;
    def _parse_delimited($_delim; $_quot; $_nl; $_skip):
        [($_delim, $_quot, $_nl, $_skip)|explode[]] as [$delim, $quot, $nl, $skip] |
        [0,1,2,3,4,5] as [$s_start,$s_next_value,$s_read_value,$s_read_quoted,$s_escape,$s_final] |
        def _append($arr; $value):
            $arr + [$value];
        def _do_start($c):
            if $c == $nl then
                [$s_start, null, null, _append(.[3]; [""])]
            elif $c == $delim then
                [$s_next_value, null, [""], .[3]]
            elif $c == $quot then
                [$s_read_quoted, [], [], .[3]]
            else
                [$s_read_value, [$c], [], .[3]]
            end;
        def _do_next_value($c):
            if $c == $nl then
                [$s_start, null, null, _append(.[3]; _append(.[2]; ""))]
            elif $c == $delim then
                [$s_next_value, null, _append(.[2]; ""), .[3]]
            elif $c == $quot then
                [$s_read_quoted, [], .[2], .[3]]
            else
                [$s_read_value, [$c], .[2], .[3]]
            end;
        def _do_read_value($c):
            if $c == $nl then
                [$s_start, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
            elif $c == $delim then
                [$s_next_value, null, _append(.[2]; .[1]|implode), .[3]]
            else
                [$s_read_value, _append(.[1]; $c), .[2], .[3]]
            end;
        def _do_read_quoted($c):
            if $c == $quot then
                [$s_escape, .[1], .[2], .[3]]
            else
                [$s_read_quoted, _append(.[1]; $c), .[2], .[3]]
            end;
        def _do_escape($c):
            if $c == $nl then
                [$s_start, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
            elif $c == $delim then
                [$s_next_value, null, _append(.[2]; .[1]|implode), .[3]]
            else
                [$s_read_quoted, _append(.[1]; $c), .[2], .[3]]
            end;
        def _do_final($c):
            .;
        def _do_finalize:
            if .[0] == $s_start then
                [$s_final, null, null, .[3]]
            elif .[0] == $s_next_value then
                [$s_final, null, null, _append(.[3]; [""])]
            elif .[0] == $s_read_value then
                [$s_final, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
            elif .[0] == $s_read_quoted then
                [$s_final, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
            elif .[0] == $s_escape then
                [$s_final, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
            else # .[0] == $s_final
                .
            end;
        reduce explode[] as $c (
            [$s_start,null,null,[]];
            do_if($c != $skip;
                if .[0] == $s_start then
                    _do_start($c)
                elif .[0] == $s_next_value then
                    _do_next_value($c)
                elif .[0] == $s_read_value then
                    _do_read_value($c)
                elif .[0] == $s_read_quoted then
                    _do_read_quoted($c)
                elif .[0] == $s_escape then
                    _do_escape($c)
                else # .[0] == $s_final
                    _do_final($c)
                end
            )
        )
        | _do_finalize[3][];
    def parse_delimited($delim; $quot; $nl; $skip):
        _parse_delimited($delim; $quot; $nl; $skip);
    def parse_delimited($delim; $quot; $nl):
        parse_delimited($delim; $quot; $nl; "\r");
    def parse_delimited($delim; $quot):
        parse_delimited($delim; $quot; "\n");
    def parse_delimited($delim):
        parse_delimited($delim; "\"");
    def parse_csv:
        parse_delimited(",");
    

    For your data, you would want to change the delimiter to semicolons.

    $ cat se.csv
    -21.3214077;55.4851413;Ruizia cordata
    -21.3213078;55.4849803;Cossinia pinnata
    $ jq -R 'parse_delimited(";")' se.csv
    [
      "-21.3214077",
      "55.4851413",
      "Ruizia cordata"
    ]
    [
      "-21.3213078",
      "55.4849803",
      "Cossinia pinnata"
    ]
    

    This will work fine for most inputs to parse a line at a time, but if your data has literal newlines, you will want to read the entire file as a string.

    $ cat input.csv
    Year,Make,Model,Description,Price
    1997,Ford,E350,"ac, abs, moon",3000.00
    1999,Chevy,"Venture ""Extended Edition""","",4900.00
    1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
    1996,Jeep,Grand Cherokee,"MUST SELL!
    air, moon roof, loaded",4799.00
    $ jq -Rs 'parse_csv' input.csv
    [
      "Year",
      "Make",
      "Model",
      "Description",
      "Price"
    ]
    [
      "1997",
      "Ford",
      "E350",
      "ac, abs, moon",
      "3000.00"
    ]
    [
      "1999",
      "Chevy",
      "Venture \"Extended Edition\"",
      "",
      "4900.00"
    ]
    [
      "1999",
      "Chevy",
      "Venture \"Extended Edition, Very Large\"",
      "",
      "5000.00"
    ]
    [
      "1996",
      "Jeep",
      "Grand Cherokee",
      "MUST SELL!\nair, moon roof, loaded",
      "4799.00"
    ]
    
    0 讨论(0)
  • 2021-02-04 03:37

    The accepted answer uses jq to parse the input. This works but jq doesn't handle escapes i.e. input from a CSV produced from Excel or similar tools is quoted like this:

    foo,"bar,baz",gaz
    

    will result in the incorrect output, as jq will see 4 fields, not 3.

    One option is to use tab-separated values instead of comma (as long as your input data doesn't contain tabs!), along with the accepted answer.

    Another option is to combine your tools, and use the best tool for each part: a CSV parser for reading the input and turning it into JSON, and jq for transforming the JSON into the target format.

    The python-based csvkit will intelligently parse the CSV, and comes with a tool csvjson which will do a much better job of turning the CSV into JSON. This can then be piped through jq to convert the flat JSON output by csvkit into the target form.

    With the data provided by the OP, for the desired output, this as as simple as:

    csvjson --no-header-row  |
      jq '.[] | {occurrences: [{ position: [.a, .b], taxo: {espece: .c}}]}'
    

    Note that csvjson automatically detects ; as the delimiter, and without a header row in the input, assigns the json keys as a, b, and c.

    The same also applies to writing to CSV files -- csvkit can read a JSON array or new-line delimited JSON, and intelligently output a CSV via in2csv.

    0 讨论(0)
提交回复
热议问题