Can I speedup YAML?

前端未结

关注

 3  1993

I made a little test case to compare YAML and JSON speed :

import json
import yaml
from datetime import datetime
from random import randint

NB_ROW=1024

pri


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2020-12-29 05:46
              
            
            
                                                                       
For reference, I compared a couple of human-readable formats and indeed Python's yaml reader is by far the slowest. (Note the log-scaling in the below plot.) If you're looking for speed, you want one of the JSON loaders, e.g., orjson:


Code to reproduce the plot:
import numpy
import perfplot

import json
import ujson
import orjson
import toml
import yaml
from yaml import Loader, CLoader
import pandas


def setup(n):
    numpy.random.seed(0)
    data = numpy.random.rand(n, 3)

    with open("out.yml", "w") as f:
        yaml.dump(data.tolist(), f)

    with open("out.json", "w") as f:
        json.dump(data.tolist(), f, indent=4)

    with open("out.dat", "w") as f:
        numpy.savetxt(f, data)

    with open("out.toml", "w") as f:
        toml.dump({"data": data.tolist()}, f)


def yaml_python(arr):
    with open("out.yml", "r") as f:
        out = yaml.load(f, Loader=Loader)
    return out


def yaml_c(arr):
    with open("out.yml", "r") as f:
        out = yaml.load(f, Loader=CLoader)
    return out


def json_load(arr):
    with open("out.json", "r") as f:
        out = json.load(f)
    return out


def ujson_load(arr):
    with open("out.json", "r") as f:
        out = ujson.load(f)
    return out


def orjson_load(arr):
    with open("out.json", "rb") as f:
        out = orjson.loads(f.read())
    return out


def loadtxt(arr):
    with open("out.dat", "r") as f:
        out = numpy.loadtxt(f)
    return out


def pandas_read(arr):
    out = pandas.read_csv("out.dat", header=None, sep=" ")
    return out.values


def toml_load(arr):
    with open("out.toml", "r") as f:
        out = toml.load(f)
    return out["data"]


perfplot.save(
    "out.png",
    setup=setup,
    kernels=[
        yaml_python,
        yaml_c,
        json_load,
        loadtxt,
        pandas_read,
        toml_load,
        ujson_load,
        orjson_load,
    ],
    n_range=[2 ** k for k in range(18)],
)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2020-12-29 05:48
              
            
            
                                                                       
Yes, I also noticed that JSON is way faster. So a reasonable approach would be to convert YAML to JSON first. If you don't mind ruby, then you can get a big speedup and ditch the yaml install altogether:

import commands, json
def load_yaml_file(fn):
    ruby = "puts YAML.load_file('%s').to_json" % fn
    j = commands.getstatusoutput('ruby -ryaml -rjson -e "%s"' % ruby)
    return json.loads(j[1])


Here is a comparison for 100K records:

load_yaml_file: 0.95 s
yaml.load: 7.53 s


And for 1M records:

load_yaml_file: 11.55 s
yaml.load: 77.08 s


If you insist on using yaml.load anyway, remember to put it in a virtualenv to avoid conflicts with other software.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2020-12-29 05:59
              
            
            
                                                                       
You've probably noticed that Python's syntax for data structures is very similar to JSON's syntax.

What's happening is Python's json library encodes Python's builtin datatypes directly into text chunks, replacing ' into " and deleting , here and there (to oversimplify a bit).

On the other hand, pyyaml has to construct a whole representation graph before serialising it into a string.

The same kind of stuff has to happen backwards when loading.

The only way to speedup yaml.load() would be to write a new Loader, but I doubt it could be a huge leap in performance, except if you're willing to write your own single-purpose sort-of YAML parser, taking the following comment in consideration:


  YAML builds a graph because it is a general-purpose serialisation
  format that is able to represent multiple references to the same
  object. If you know no object is repeated and only basic types appear,
  you can use a json serialiser, it will still be valid YAML.


-- UPDATE

What I said before remains true, but if you're running Linux there's a way to speed up Yaml parsing. By default, Python's yaml uses the Python parser. You have to tell it that you want to use PyYaml C parser.

You can do it this way:

import yaml
from yaml import CLoader as Loader, CDumper as Dumper

dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper)
data = yaml.load(fh, Loader=Loader)


In order to do so, you need yaml-cpp-dev (package later renamed to libyaml-cpp-dev) installed, for instance with apt-get:

$ apt-get install yaml-cpp-dev


And PyYaml with LibYaml as well. But that's already the case based on your output.

I can't test it right now because I'm running OS X and brew has some trouble installing yaml-cpp-dev but if you follow PyYaml documentation, they are pretty clear that performance will be much better.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复