Converting string to int is too slow

后端 未结 3 1704
萌比男神i
萌比男神i 2021-01-12 00:28

I\'ve got a program that reads in 3 strings per line for 50000. It then does other things. The part that reads the file and converts to integers is taking 80% of the total r

相关标签:
3条回答
  • 2021-01-12 01:00

    If the file is in OS cache then parsing the file takes milliseconds on my machine:

    name                                 time ratio comment
    read_read                        145 usec  1.00 big.txt
    read_readtxt                    2.07 msec 14.29 big.txt
    read_readlines                  4.94 msec 34.11 big.txt
    read_james_otigo                29.3 msec 201.88 big.txt
    read_james_otigo_with_int_float 82.9 msec 571.70 big.txt
    read_map_local                  93.1 msec 642.23 big.txt
    read_map                        95.6 msec 659.57 big.txt
    read_numpy_loadtxt               321 msec 2213.66 big.txt
    

    Where the read_*() functions are:

    def read_read(filename):
        with open(filename, 'rb') as file:
            data = file.read()
    
    def read_readtxt(filename):
        with open(filename, 'rU') as file:
            text = file.read()
    
    def read_readlines(filename):
        with open(filename, 'rU') as file:
            lines = file.readlines()
    
    def read_james_otigo(filename):
        file = open (filename).readlines()
        for line in file[1:]:
            label1, label2, edge = line.strip().split()
    
    def read_james_otigo_with_int_float(filename):
        file = open (filename).readlines()
        for line in file[1:]:
            label1, label2, edge = line.strip().split()
            label1 = int(label1); label2 = int(label2); edge = float(edge)
    
    def read_map(filename):
        with open(filename) as file:
            L = [(int(l1), int(l2), float(edge))
                 for line in file
                 for l1, l2, edge in [line.split()] if line.strip()]
    
    def read_map_local(filename, _i=int, _f=float):
        with open(filename) as file:
            L = [(_i(l1), _i(l2), _f(edge))
                 for line in file
                 for l1, l2, edge in [line.split()] if line.strip()]
    
    import numpy as np
    
    def read_numpy_loadtxt(filename):
        a = np.loadtxt('big.txt', dtype=[('label1', 'i'),
                                         ('label2', 'i'),
                                         ('edge', 'f')])
    

    And big.txt is generated using:

    #!/usr/bin/env python
    import numpy as np
    
    n = 50000
    a = np.random.random_integers(low=0, high=1<<10, size=2*n).reshape(-1, 2)
    np.savetxt('big.txt', np.c_[a, np.random.rand(n)], fmt='%i %i %s')
    

    It produces 50000 lines:

    150 952 0.355243621018
    582 98 0.227592557278
    478 409 0.546382780254
    46 879 0.177980983303
    ...
    

    To reproduce results, download the code and run:

    # write big.txt
    python generate-file.py
    # run benchmark
    python read-array.py
    
    0 讨论(0)
  • 2021-01-12 01:02

    I can't reproduce this at all.

    I have generated a file of 50000 lines, containing three random numbers (two ints, one float) on each line, separated by spaces.

    I then ran your script on that file. The original script finishes in 0.05 seconds on my three-year-old PC, the script with the uncommented line takes 0.15 seconds to finish. Of course it takes longer to do string to int/float conversions, but certainly not at the scale of several seconds. Unless your target machine is a toaster running embedded Windows CE.

    0 讨论(0)
  • 2021-01-12 01:14

    I'm able to get almost same timings as yours. I think the problem was with my code that was doing the timings:

    read_james_otigo                  40 msec big.txt
    read_james_otigo_with_int_float  116 msec big.txt
    read_map                         134 msec big.txt
    read_map_local                   131 msec big.txt
    read_numpy_loadtxt               400 msec big.txt
    read_read                        488 usec big.txt
    read_readlines                  9.24 msec big.txt
    read_readtxt                    4.36 msec big.txt
    
    name                                 time ratio comment
    read_read                        488 usec  1.00 big.txt
    read_readtxt                    4.36 msec  8.95 big.txt
    read_readlines                  9.24 msec 18.95 big.txt
    read_james_otigo                  40 msec 82.13 big.txt
    read_james_otigo_with_int_float  116 msec 238.64 big.txt
    read_map_local                   131 msec 268.05 big.txt
    read_map                         134 msec 274.87 big.txt
    read_numpy_loadtxt               400 msec 819.42 big.txt
    
    
    read_james_otigo                39.4 msec big.txt
    read_readtxt                    4.37 msec big.txt
    read_readlines                  9.21 msec big.txt
    read_map_local                   131 msec big.txt
    read_james_otigo_with_int_float  116 msec big.txt
    read_map                         134 msec big.txt
    read_read                        487 usec big.txt
    read_numpy_loadtxt               398 msec big.txt
    
    name                                 time ratio comment
    read_read                        487 usec  1.00 big.txt
    read_readtxt                    4.37 msec  8.96 big.txt
    read_readlines                  9.21 msec 18.90 big.txt
    read_james_otigo                39.4 msec 80.81 big.txt
    read_james_otigo_with_int_float  116 msec 238.51 big.txt
    read_map_local                   131 msec 268.84 big.txt
    read_map                         134 msec 275.11 big.txt
    read_numpy_loadtxt               398 msec 816.71 big.txt
    
    0 讨论(0)
提交回复
热议问题