Importing wrongly concatenated JSONs in python

后端未结

关注

 3  1040

I\'ve a text document that has several thousand jsons strings in the form of: \"{...}{...}{...}\". This is not a valid json it self but each {...}

相关标签:

3条回答

醉梦人生

2021-01-07 07:50
You can use the jq command line utility to transfer your input to json. Let's say you have the following input:

input.txt:
```
{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}
```
You can use jq -s, which consumes multiple json documents from input and transfers them into a single output array:
```
jq -s . input.txt
```
Gives you:
```
[
  {
    "name": "Bob Dylan",
    "tags": "{Artist}{Singer}"
  },
  {
    "name": "Michael Jackson"
  }
]
```
I've just realized that there are python bindings for libjq. Meaning you don't need to use the command line, you can use jq directly in python.

https://github.com/mwilliamson/jq.py

However, I've not tried it so far. Let me give it a try :) ...

Update: The above library is nice, but it does not support the slurp mode so far.
0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2021-01-07 07:51
Use the raw_decode method of json.JSONDecoder
```
>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)
```
raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.

To loop until the end or until an invalid JSON element is encountered:
```
>>> while True:
...   try:
...     j,n = d.raw_decode(x)
...   except ValueError:
...     break
...   print(j)
...   x=x[n:]
... 
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}
```
When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.

With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.
0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2021-01-07 08:07
you need to make a parser ... I dont think regex can help you for
```
data = ""
curlies = []
def get_dicts(file_text):
    for letter in file_text:
        data += letter
        if letter == "{":
           curlies.append(letter)
        elif letter == "}":
           curlies.pop() # remove last
           if not curlies:
              yield json.loads(data)
              data = ""
```
note that this does not actually solve the problem that {name:"bob"} is not valid json ... {"name":"bob"} is

this will also break in the event you have weird unbalanced parenthesis inside of strings ie {"name":"{{}}}"} would break this

really your json is so broken based on your example your best bet is probably to edit it by hand and fix the code that is generating it ... if that is not feasible you may need to write a more complex parser using pylex or some other grammar library (effectively writing your own language parser)
0 讨论(0)
发布评论:

提交评论
- 加载中...