问题
I have a jsonl file which contains per line both a sentence and the tokens that are found in that sentence. I wish to extract the tokens from each line in the JSON lines file, but my loop only returns the tokens from the last line.
This is the input.
{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is the second sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"second","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
I have tried running the following code:
with jsonlines.open('path/to/file') as reader:
for obj in reader:
data = obj['tokens'] # just extract the tokens
data = [(i['text'], i['id']) for i in data] # elements from the tokens
data
The actual result:
[('This', 0), ('is', 1), ('the', 2), ('first', 3), ('sentence', 4), ('.', 5)]
What the result is that I want to get to:
Additional question
Some tokens contain a "label" instead of an "id". How could I incorporate that into the code? An example would be:
{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is coded in python.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"coded","id":2},
{"text":"in","id":3},
{"text":"python","label":"Programming"},
{"text":".","id":5}]}
回答1:
Some issues/changes in the code
You are reassign the variable
data
in the loop everytime, hence you only see the result for the last json line, instead you want to extend the list everytimeYou want to use
enumerate
on thereader
iterator to get the first item of the tuple
The code then changes to
import jsonlines
data = []
#Iterate over the json files
with jsonlines.open('file.txt') as reader:
#Iterate over the each line on the reader via enumerate
for idx, obj in enumerate(reader):
#Append the data to the result
data.extend([(idx+1, i['text'], i['id']+1) for i in obj['tokens']]) # elements from the tokens
print(data)
Or more compact by making a double for-loop in the list comprehension itself
import jsonlines
#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i['id']+1) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]
print(result)
The output will be
[
(1, 'This', 1),
(1, 'is', 2),
(1, 'the', 3),
(1, 'first', 4),
(1, 'sentence', 5),
(1, '.', 6),
(2, 'This', 1),
(2, 'is', 2),
(2, 'the', 3),
(2, 'second', 4),
(2, 'sentence', 5),
(2, '.', 6)
]
回答2:
f=open('data.csv','w')
print('Sentence','Word','ID',file=f)
with jsonlines.open('path/to/file') as reader:
for sentence_no,obj in enumerate(reader):
data = obj['tokens']
for i in data:
print(sentence_no+1,i['text'], i['id']+1,file=f)
来源:https://stackoverflow.com/questions/56314313/how-to-extract-elements-from-each-line-in-a-jsonline-file