Can I speedup YAML?

前端 未结 3 1993
耶瑟儿~
耶瑟儿~ 2020-12-29 05:22

I made a little test case to compare YAML and JSON speed :

import json
import yaml
from datetime import datetime
from random import randint

NB_ROW=1024

pri         


        
相关标签:
3条回答
  • 2020-12-29 05:46

    For reference, I compared a couple of human-readable formats and indeed Python's yaml reader is by far the slowest. (Note the log-scaling in the below plot.) If you're looking for speed, you want one of the JSON loaders, e.g., orjson:


    Code to reproduce the plot:

    import numpy
    import perfplot
    
    import json
    import ujson
    import orjson
    import toml
    import yaml
    from yaml import Loader, CLoader
    import pandas
    
    
    def setup(n):
        numpy.random.seed(0)
        data = numpy.random.rand(n, 3)
    
        with open("out.yml", "w") as f:
            yaml.dump(data.tolist(), f)
    
        with open("out.json", "w") as f:
            json.dump(data.tolist(), f, indent=4)
    
        with open("out.dat", "w") as f:
            numpy.savetxt(f, data)
    
        with open("out.toml", "w") as f:
            toml.dump({"data": data.tolist()}, f)
    
    
    def yaml_python(arr):
        with open("out.yml", "r") as f:
            out = yaml.load(f, Loader=Loader)
        return out
    
    
    def yaml_c(arr):
        with open("out.yml", "r") as f:
            out = yaml.load(f, Loader=CLoader)
        return out
    
    
    def json_load(arr):
        with open("out.json", "r") as f:
            out = json.load(f)
        return out
    
    
    def ujson_load(arr):
        with open("out.json", "r") as f:
            out = ujson.load(f)
        return out
    
    
    def orjson_load(arr):
        with open("out.json", "rb") as f:
            out = orjson.loads(f.read())
        return out
    
    
    def loadtxt(arr):
        with open("out.dat", "r") as f:
            out = numpy.loadtxt(f)
        return out
    
    
    def pandas_read(arr):
        out = pandas.read_csv("out.dat", header=None, sep=" ")
        return out.values
    
    
    def toml_load(arr):
        with open("out.toml", "r") as f:
            out = toml.load(f)
        return out["data"]
    
    
    perfplot.save(
        "out.png",
        setup=setup,
        kernels=[
            yaml_python,
            yaml_c,
            json_load,
            loadtxt,
            pandas_read,
            toml_load,
            ujson_load,
            orjson_load,
        ],
        n_range=[2 ** k for k in range(18)],
    )
    
    0 讨论(0)
  • 2020-12-29 05:48

    Yes, I also noticed that JSON is way faster. So a reasonable approach would be to convert YAML to JSON first. If you don't mind ruby, then you can get a big speedup and ditch the yaml install altogether:

    import commands, json
    def load_yaml_file(fn):
        ruby = "puts YAML.load_file('%s').to_json" % fn
        j = commands.getstatusoutput('ruby -ryaml -rjson -e "%s"' % ruby)
        return json.loads(j[1])
    

    Here is a comparison for 100K records:

    load_yaml_file: 0.95 s
    yaml.load: 7.53 s
    

    And for 1M records:

    load_yaml_file: 11.55 s
    yaml.load: 77.08 s
    

    If you insist on using yaml.load anyway, remember to put it in a virtualenv to avoid conflicts with other software.

    0 讨论(0)
  • 2020-12-29 05:59

    You've probably noticed that Python's syntax for data structures is very similar to JSON's syntax.

    What's happening is Python's json library encodes Python's builtin datatypes directly into text chunks, replacing ' into " and deleting , here and there (to oversimplify a bit).

    On the other hand, pyyaml has to construct a whole representation graph before serialising it into a string.

    The same kind of stuff has to happen backwards when loading.

    The only way to speedup yaml.load() would be to write a new Loader, but I doubt it could be a huge leap in performance, except if you're willing to write your own single-purpose sort-of YAML parser, taking the following comment in consideration:

    YAML builds a graph because it is a general-purpose serialisation format that is able to represent multiple references to the same object. If you know no object is repeated and only basic types appear, you can use a json serialiser, it will still be valid YAML.

    -- UPDATE

    What I said before remains true, but if you're running Linux there's a way to speed up Yaml parsing. By default, Python's yaml uses the Python parser. You have to tell it that you want to use PyYaml C parser.

    You can do it this way:

    import yaml
    from yaml import CLoader as Loader, CDumper as Dumper
    
    dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper)
    data = yaml.load(fh, Loader=Loader)
    

    In order to do so, you need yaml-cpp-dev (package later renamed to libyaml-cpp-dev) installed, for instance with apt-get:

    $ apt-get install yaml-cpp-dev
    

    And PyYaml with LibYaml as well. But that's already the case based on your output.

    I can't test it right now because I'm running OS X and brew has some trouble installing yaml-cpp-dev but if you follow PyYaml documentation, they are pretty clear that performance will be much better.

    0 讨论(0)
提交回复
热议问题