Minimise search time for python in a large CSV file

前端 未结 1 464
挽巷
挽巷 2021-01-16 21:55

I have a CSV file with about 700 or so rows and 8 columns, the last column however, has a very big block of text (with enough for multiple long paragraphs inside each).

1条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-16 22:25

    You could dump your csv file into an sqlite database and use sqlite's full text search capabilities to do the search for you.

    This example code shows how it could be done. There are a few things to be aware of:

    • it assumes that the csv file has a header row, and that the values of the headers will make legal column names in sqlite. If this isn't the case, you'll need to quote them (or just use generic names like "col1", "col2" etc).
    • it searches all columns in the csv; if that's undesirable, filter out the other columns (and header values) before creating the SQL statements.
    • If you want to be able to match the results to rows in the csv file, you'll need create a column that contains the line number.
    import csv
    import sqlite3
    import sys
    
    
    def create_table(conn, headers, name='mytable'):
        cols = ', '.join([x.strip() for x in headers])
        stmt = f"""CREATE VIRTUAL TABLE {name} USING fts5({cols})"""
        with conn:
            conn.execute(stmt)
        return
    
    
    def populate_table(conn, reader, ncols, name='mytable'):
        placeholders = ', '.join(['?'] * ncols)
        stmt = f"""INSERT INTO {name}
        VALUES ({placeholders})
        """
        with conn:
            conn.executemany(stmt, reader)
        return
    
    
    def search(conn, term, headers, name='mytable'):
        cols = ', '.join([x.strip() for x in headers])
        stmt = f"""SELECT {cols}
        FROM {name}
        WHERE {name} MATCH ?
        """
        with conn:
            cursor = conn.cursor()
            cursor.execute(stmt, (term,))
            result = cursor.fetchall()
        return result
    
    
    def main(path, term):
        result = 'NO RESULT SET'
        try:
            # Create an in-memory database.
            conn = sqlite3.connect(':memory:')
            with open(path, 'r') as f:
                reader = csv.reader(f)
                # Assume headers are in the first row
                headers = next(reader)
                create_table(conn, headers)
                ncols = len(headers)
                populate_table(conn, reader, ncols)
            result = search(conn, term, headers)
        finally:
            conn.close()
        return result
    
    
    if __name__ == '__main__':
        print(main(*sys.argv[1:]))
    

    0 讨论(0)
提交回复
热议问题