How to remove not useful elements from a dataset

问题

I have a dataset, and it look like the following:

 {0: {"address": 0,
         "ctag": "TOP",
         "deps": defaultdict(<class "list">, {"ROOT": [6, 51]}),
         "feats": "",
         "head": "",
         "lemma": "",
         "rel": "",
         "tag": "TOP",
         "word": ""},
     1: {"address": 1,
         "ctag": "Ne",
         "deps": defaultdict(<class "list">, {"NPOSTMOD": [2]}),
         "feats": "_",
         "head": 6,
         "lemma": "اشرف",
         "rel": "SBJ",
         "tag": "Ne",
         "word": "اشرف"},

I want to remove "deps":...? from this dataset. I tried this code but does not work, because the value of "depts": differ in each element of the dict.

import re
import simplejson as simplejson

with open("../data/cleaned.txt", 'r') as fp:
    lines = fp.readlines()
    k = str(lines)
    a = re.sub(r'\d:', '', k) # this is for removing numbers like `1:{..`
    json_data = simplejson.dumps(a)
    #print(json_data)
    n = eval(k.replace('defaultdict(<class "list">', 'list'))
    print(n)

回答1:

The correct way would be to fix the code that produced the text file. This defaultdict(<class "list">, {"ROOT": [6, 51]}) is a hint that it used a simple repr when a smarter format was required.

The following is just a poor man's workaround if the real fix is not possible.

Getting rid of "deps": ... is easy: it is enough to read the file one line at a time and discard any one starting with ""deps" (ignoring initial white spaces). But it is not enough, because the file contains numeric keys when json insist on keys being only text. So the numerics key must be identified and quoted.

This could allow to load the file:

import re import simplejson as simplejson

with open("../data/cleaned.txt", 'r') as fp:
    k = ''.join(re.sub(r'(?<!\w)(\d+)', r'"\1"',line)
        for line in fp if not line.strip().startswith('"deps"'))

# remove an eventual last comma
k = re.sub(r',[\s\n]*$', '', k, re.DOTALL)

# uncomment if the file does not contain the last }
# k += '}'

js = json.loads(k)

回答2:

Try

import json
with open("../data/cleaned.txt", 'r') as fp:
    data = json.load(fp)
    for key, value in data.items():
        value.pop("deps", None)

Now you will have data without deps in it. In case if you want to dump the records to a new file

json.dump(data, "output.json")

回答3:

How about

#!/usr/bin/env python
# -*- coding: utf-8 -*-

data = {0: {"address": 0,
            "ctag": "TOP",
            "deps": 'something',
            "feats": "",
            "head": "",
            "lemma": "",
            "rel": "",
            "tag": "TOP",
            "word": ""},
        1: {"address": 1,
            "ctag": "Ne",
            "deps": 'something',
            "feats": "_",
            "head": 6,
            "lemma": "اشرف",
            "rel": "SBJ",
            "tag": "Ne",
            "word": "اشرف"}}

for value in data.values():
    if 'deps' in value:
        del value['deps']

来源：https://stackoverflow.com/questions/55042037/how-to-remove-not-useful-elements-from-a-dataset

标签

python

json

preprocessor