Read the last N lines of a CSV file in Python with numpy / pandas

后端 未结 3 632
北荒
北荒 2021-01-13 18:05

Is there a quick way to read the last N lines of a CSV file in Python, using numpy or pandas?

  1. I cannot do skip_header

相关标签:
3条回答
  • 2021-01-13 18:48

    With a small 10 line test file I tried 2 approaches - parse the whole thing and select the last N lines, versus load all lines, but only parse the last N:

    In [1025]: timeit np.genfromtxt('stack38704949.txt',delimiter=',')[-5:]
    1000 loops, best of 3: 741 µs per loop
    
    In [1026]: %%timeit 
          ...: with open('stack38704949.txt','rb') as f:
          ...:      lines = f.readlines()
          ...: np.genfromtxt(lines[-5:],delimiter=',')
    
    1000 loops, best of 3: 378 µs per loop
    

    This was tagged as a duplicate of Efficiently Read last 'n' rows of CSV into DataFrame. The accepted answer there used

    from collections import deque
    

    and collected the last N lines in that structure. It also used StringIO to feed the lines to the parser, which is an unnecessary complication. genfromtxt takes input from anything that gives it lines, so a list of lines is just fine.

    In [1031]: %%timeit 
          ...: with open('stack38704949.txt','rb') as f:
          ...:      lines = deque(f,5)
          ...: np.genfromtxt(lines,delimiter=',') 
    
    1000 loops, best of 3: 382 µs per loop
    

    Basically the same time as readlines and slice.

    deque may have an advantage when the file is very large, and it gets costly to hang onto all the lines. I don't think it saves any file reading time. Lines still have to be read one by one.

    timings for the row_count followed by skip_header approach are slower; it requires reading the file twice. skip_header still has to read lines.

    In [1046]: %%timeit 
          ...: with open('stack38704949.txt',"r") as f:
          ...:       ...:     reader = csv.reader(f,delimiter = ",")
          ...:       ...:     data = list(reader)
          ...:       ...:     row_count = len(data)
          ...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
    
    The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
    1000 loops, best of 3: 760 µs per loop
    

    For purposes of counting lines we don't need to use csv.reader, though it doesn't appear to cost much extra time.

    In [1048]: %%timeit 
          ...: with open('stack38704949.txt',"r") as f:
          ...:    lines=f.readlines()
          ...:    row_count = len(data)
          ...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
    
    1000 loops, best of 3: 736 µs per loop
    
    0 讨论(0)
  • 2021-01-13 18:53

    Option 1

    You can read the entire file with numpy.genfromtxt, get it as a numpy array, and take the last N rows:

    a = np.genfromtxt('filename', delimiter=',')
    lastN = a[-N:]
    

    Option 2

    You can do a similar thing with the usual file reading:

    with open('filename') as f:
        lastN = list(f)[-N:]
    

    but this time you will get the list of last N lines, as strings.

    Option 3 - without reading the entire file to memory

    We use a list of at most N items to hold each iteration the last N lines:

    lines = []
    N = 10
    with open('csv01.txt') as f:
        for line in f:
            lines.append(line)
            if len(lines) > 10:
                lines.pop(0)
    

    A real csv requires a minor change:

    import csv
    ...
    with ...
        for line in csv.reader(f):
        ...
    
    0 讨论(0)
  • 2021-01-13 19:00

    Use skiprows parameter of pandas read_csv(), the tougher part is finding the number of lines in the csv. here's a possible solution:

    with open('filename',"r") as f:
        reader = csv.reader(f,delimiter = ",")
        data = list(reader)
        row_count = len(data)
    
    df = pd.read_csv('filename', skiprows = row_count - N)
    
    0 讨论(0)
提交回复
热议问题