How to remove not useful elements from a dataset

自作多情 提交于 2020-01-15 12:17:12

问题


I have a dataset, and it look like the following:

 {0: {"address": 0,
         "ctag": "TOP",
         "deps": defaultdict(<class "list">, {"ROOT": [6, 51]}),
         "feats": "",
         "head": "",
         "lemma": "",
         "rel": "",
         "tag": "TOP",
         "word": ""},
     1: {"address": 1,
         "ctag": "Ne",
         "deps": defaultdict(<class "list">, {"NPOSTMOD": [2]}),
         "feats": "_",
         "head": 6,
         "lemma": "اشرف",
         "rel": "SBJ",
         "tag": "Ne",
         "word": "اشرف"},

I want to remove "deps":...? from this dataset. I tried this code but does not work, because the value of "depts": differ in each element of the dict.

import re
import simplejson as simplejson

with open("../data/cleaned.txt", 'r') as fp:
    lines = fp.readlines()
    k = str(lines)
    a = re.sub(r'\d:', '', k) # this is for removing numbers like `1:{..`
    json_data = simplejson.dumps(a)
    #print(json_data)
    n = eval(k.replace('defaultdict(<class "list">', 'list'))
    print(n)

回答1:


The correct way would be to fix the code that produced the text file. This defaultdict(<class "list">, {"ROOT": [6, 51]}) is a hint that it used a simple repr when a smarter format was required.

The following is just a poor man's workaround if the real fix is not possible.

Getting rid of "deps": ... is easy: it is enough to read the file one line at a time and discard any one starting with ""deps" (ignoring initial white spaces). But it is not enough, because the file contains numeric keys when json insist on keys being only text. So the numerics key must be identified and quoted.

This could allow to load the file:

import re import simplejson as simplejson

with open("../data/cleaned.txt", 'r') as fp:
    k = ''.join(re.sub(r'(?<!\w)(\d+)', r'"\1"',line)
        for line in fp if not line.strip().startswith('"deps"'))

# remove an eventual last comma
k = re.sub(r',[\s\n]*$', '', k, re.DOTALL)

# uncomment if the file does not contain the last }
# k += '}'

js = json.loads(k)



回答2:


Try

import json
with open("../data/cleaned.txt", 'r') as fp:
    data = json.load(fp)
    for key, value in data.items():
        value.pop("deps", None)

Now you will have data without deps in it. In case if you want to dump the records to a new file

json.dump(data, "output.json")



回答3:


How about

#!/usr/bin/env python
# -*- coding: utf-8 -*-

data = {0: {"address": 0,
            "ctag": "TOP",
            "deps": 'something',
            "feats": "",
            "head": "",
            "lemma": "",
            "rel": "",
            "tag": "TOP",
            "word": ""},
        1: {"address": 1,
            "ctag": "Ne",
            "deps": 'something',
            "feats": "_",
            "head": 6,
            "lemma": "اشرف",
            "rel": "SBJ",
            "tag": "Ne",
            "word": "اشرف"}}

for value in data.values():
    if 'deps' in value:
        del value['deps']


来源:https://stackoverflow.com/questions/55042037/how-to-remove-not-useful-elements-from-a-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!