Getting duplicate keys in YAML using Python

别来无恙 提交于 2019-11-30 22:22:36

PyYAML will just silently overwrite the first entry, ruamel.yaml¹ will give a DuplicateKeyFutureWarning if used with the legacy API, and raise a DuplicateKeyError with the new API.

If you don't want to create a full Constructor for all types, overwriting the mapping constructor in SafeConstructor should do the job:

import sys
from ruamel.yaml import YAML
from ruamel.yaml.constructor import SafeConstructor

yaml_str = """\
build:
  step: 'step1'

build:
  step: 'step2'
"""


def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))


SafeConstructor.add_constructor(u'tag:yaml.org,2002:map', construct_yaml_map)
yaml = YAML(typ='safe')
data = yaml.load(yaml_str)
print(data)

which gives:

[('build', [('step', 'step1')]), ('build', [('step', 'step2')])]

However it doesn't seem necessary to make step: 'step1' into a list. The following will only create the list if there are duplicate items (could be optimised if necessary, by caching the result of the self.construct_object(key_node, deep=True)):

def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    keys = set()
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        if key in keys:
            break
        keys.add(key)
    else:
        data = {}  # type: Dict[Any, Any]
        yield data
        value = self.construct_mapping(node)
        data.update(value)
        return
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))

which gives:

[('build', {'step': 'step1'}), ('build', {'step': 'step2'})]

Some points:

  • Probably needless to say, this will not work with YAML merge keys (<<: *xyz)
  • If you need ruamel.yaml's round-trip capabilities (yaml = YAML()) , that will require a more complex construct_yaml_map.
  • If you want to dump the output, you should instantiate a new YAML() instance for that, instead of re-using the "patched" one used for loading (it might work, this is just to be sure):

    yaml_out = YAML(typ='safe')
    yaml_out.dump(data, sys.stdout)
    

    which gives (with the first construct_yaml_map):

    - - build
      - - [step, step1]
    - - build
      - - [step, step2]
    
  • What doesn't work in PyYAML nor ruamel.yaml is yaml.load('file.yml'). If you don't want to open() the file yourself you can do:

    from pathlib import Path  # or: from ruamel.std.pathlib import Path
    yaml = YAML(typ='safe')
    yaml.load(Path('file.yml')
    

¹ Disclaimer: I am the author of that package.

If you can modify the input data very slightly, you should be able to do this by converting the single yaml-like file into multiple yaml documents. yaml documents can be in the same file if they're separated by --- on a line by itself, and you handily appear to have entries separated by two newlines next to each other:

with open('file.yml', 'r') as f:
    data = f.read()
    data = data.replace('\n\n', '\n---\n')

    for document in yaml.load_all(data):
        print(document)

Output:

{'build': {'step': 'step1'}}
{'build': {'step': 'step2'}}

You can override how pyyaml loads keys. For example, you could use a defaultdict with lists of values for each keys:

from collections import defaultdict
import yaml


def parse_preserving_duplicates(src):
    # We deliberately define a fresh class inside the function,
    # because add_constructor is a class method and we don't want to
    # mutate pyyaml classes.
    class PreserveDuplicatesLoader(yaml.loader.Loader):
        pass

    def map_constructor(loader, node, deep=False):
        """Walk the mapping, recording any duplicate keys.

        """
        mapping = defaultdict(list)
        for key_node, value_node in node.value:
            key = loader.construct_object(key_node, deep=deep)
            value = loader.construct_object(value_node, deep=deep)

            mapping[key].append(value)

        return mapping

    PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, map_constructor)
    return yaml.load(src, PreserveDuplicatesLoader)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!