Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?

半腔热情 提交于 2019-11-26 17:08:30

问题


I have a large JSON file with I'm guessing 4 million objects. Each top level has a few levels nested inside. I want to split that into multiple files of 10000 top level objects each (retaining the structure inside each). jq should be able to do that right? I'm not sure how.

So data like this:

[{
  "id": 1,
  "user": {
    "name": "Nichols Cockle",
    "email": "ncockle0@tmall.com",
    "address": {
      "city": "Turt",
      "state": "Thị Trấn Yên Phú"
    }
  },
  "product": {
    "name": "Lychee - Canned",
    "code": "36987-1526"
  }
}, {
  "id": 2,
  "user": {
    "name": "Isacco Scrancher",
    "email": "iscrancher1@aol.com",
    "address": {
      "city": "Likwatang Timur",
      "state": "Biharamulo"
    }
  },
  "product": {
    "name": "Beer - Original Organic Lager",
    "code": "47993-200"
  }
}, {
  "id": 3,
  "user": {
    "name": "Elga Sikora",
    "email": "esikora2@statcounter.com",
    "address": {
      "city": "Wenheng",
      "state": "Piedra del Águila"
    }
  },
  "product": {
    "name": "Parsley - Dried",
    "code": "36987-1632"
  }
}, {
  "id": 4,
  "user": {
    "name": "Andria Keatch",
    "email": "akeatch3@salon.com",
    "address": {
      "city": "Arras",
      "state": "Iracemápolis"
    }
  },
  "product": {
    "name": "Wine - Segura Viudas Aria Brut",
    "code": "51079-385"
  }
}, {
  "id": 5,
  "user": {
    "name": "Dara Sprowle",
    "email": "dsprowle4@slate.com",
    "address": {
      "city": "Huatai",
      "state": "Kaduna"
    }
  },
  "product": {
    "name": "Pork - Hock And Feet Attached",
    "code": "0054-8648"
  }
}]

Where this is a single complete object:

{
  "id": 1,
  "user": {
    "name": "Nichols Cockle",
    "email": "ncockle0@tmall.com",
    "address": {
      "city": "Turt",
      "state": "Thị Trấn Yên Phú"
    }
  },
  "product": {
    "name": "Lychee - Canned",
    "code": "36987-1526"
  }
}

And each file would be a specified number of objects like that.


回答1:


[EDIT: This answer has been revised in accordance with the revision to the question.]

The key to using jq to solve the problem is the -c command-line option, which produces output in JSON-Lines format (i.e., in the present case, one object per line). You can then use a tool such as awk or split to distribute those lines amongst several files.

If the file is not too big, then the simplest would be to start the pipeline with:

jq -c '.[]' INPUTFILE

If the file is too big to fit comfortably in memory, then you could use jq's streaming parser, like so:

jq -cn --stream 'fromstream(1|truncate_stream(inputs))'

For further discussion about the streaming parser, see e.g. the relevant section in the jq FAQ: https://github.com/stedolan/jq/wiki/FAQ#streaming-json-parser

Partitioning

For different approaches to partitioning the output produced in the first step, see for example How to split a large text file into smaller files with equal number of lines?

If it is required that each of the output files be an array of objects, then I'd probably use awk to perform both the partitioning and the re-constitution in one step, but there are many other reasonable approaches.

If the input is a sequence of JSON objects

For reference, if the original file consists of a stream or sequence of JSON objects, then the appropriate invocation would be:

jq -n -c inputs INPUTFILE

Using inputs in this manner allows arbitrarily many objects to be processed efficiently.



来源:https://stackoverflow.com/questions/49808581/using-jq-how-can-i-split-a-very-large-json-file-into-multiple-files-each-a-spec

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!