Loading 5 million rows into Pandas from MySQL

前端 未结 3 1257
栀梦
栀梦 2021-02-06 08:36

I have 5 million rows in a MySQL DB sitting over the (local) network (so quick connection, not on the internet).

The connection to the DB works fine, but if I try to do

3条回答
  •  我在风中等你
    2021-02-06 09:12

    I had a similar issue whilst working with an Oracle db (for me it turned out it was taking a long time to retrieve all the data, during which time I had no idea how far it was or whether there was any problem going on) - my solution was to stream the results of my query into a set of csv files, and then upload them into Pandas.

    I'm sure there are faster ways of doing this, but this worked surprisingly well for datasets of around 8 million rows.

    You can see the code I used at my Github page for easy_query.py but the core function I used looked like this:

    def SQLCurtoCSV (sqlstring, connstring, filename, chunksize):
        connection = ora.connect(connstring)
        cursor = connection.cursor()
        params = []
        cursor.execute(sqlstring, params)
        cursor.arraysize = 256
        r=[]
        c=0
        i=0
        for row in cursor:
            c=c+1
            r.append(row)
            if c >= chunksize:
                c = 0
                i=i+1
                df = pd.DataFrame.from_records(r)
                df.columns = [rec[0] for rec in cursor.description]
                df.to_csv(filename.replace('%%',str(i)), sep='|')
                df = None
                r = []
        if i==0:
            df = pd.DataFrame.from_records(r)
            df.columns = [rec[0] for rec in cursor.description]
            df.to_csv(filename.replace('%%',str(i)), sep='|')
            df = None
            r = []
    

    The surrounding module imports cx_Oracle, to provide various database hooks/api-calls, but I'd expect there to be similar functions available using some similarly provided MySQL api.

    What's nice is that you can see the files building up in your chosen directory, so you get some kind of feedback as to whether your extract is working, and how many results per second/minute/hour you can expect to receive.

    It also means you can work on the initial files whilst the rest are being fetched.

    Once all the data is saved down to individual files, they can be loaded up into a single Pandas dataframe using multiple pandas.read_csv and pandas.concat statements.

提交回复
热议问题