How to commit model instances and remove them from working memory a few at a time

随声附和 提交于 2019-12-12 10:16:39

问题


I have a pyramid view that is used for loading data from a large file into a database. For each line in the file it does a little processing then creates some model instances and adds them to the session. This works fine except when the files are big. For large files the view slowly eats up all my ram until everything effectively grinds to a halt.

So my idea is to process each line individually with a function that creates a session, creates the necessary model instances and adds them to the current session, then commits.

def commit_line(lTitles,lLine,oStartDate,oEndDate,iDS,dSettings):
    from sqlalchemy.orm import (
            scoped_session,
            sessionmaker,
    )
    from sqlalchemy import engine_from_config
    from pyramidapp.models import Base, DataEntry
    from zope.sqlalchemy import ZopeTransactionExtension
    import transaction

    oCurrentDBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
    engine = engine_from_config(dSettings, 'sqlalchemy.')
    oCurrentDBSession.configure(bind=engine)
    Base.metadata.bind = engine

    oEntry = DataEntry()
    oCurrentDBSession.add(oEntry)
    ...
    transaction.commit()

My requirements for this function are as follows:

  1. create a session (check)
  2. make a bunch of model instances (check)
  3. add those instances to the session (check)
  4. commit those models to the database
  5. get rid of the session (so that it and the objects created in 2 are garbage collected)

I've made sure that the newly created session is passed as an argument whenever necessary in order to stop errors to do with multiple sessions blah blah. But alas! I can't get database connections to go away and stuff isn't being committed.

I tried separating the function out into a celery task so the view executes to completion and does what it needs to but I'm getting an error in celery about having too many mysql connections no matter what I try in terms of committing and closing and disposing and I'm not sure why. And yes, I restart the celery server when I make changes.

Surely there is a simple way to do this? All I want to do is make a session commit then go away and leave me alone.


回答1:


Creating a new session for each line of your large file is going to be quite slow I would imagine.

What I would try is to commit the session and expunge all objects from it every 1000 rows or so:

counter = 0

for line in mymegafile:
    entry = process_line(line)
    session.add(entry)
    if counter > 1000:
        counter = 0
        transaction.commit()  # if you insist on using ZopeTransactionExtension, otherwise session.commit()
        session.expunge_all() # this may not be required actually, see https://groups.google.com/forum/#!topic/sqlalchemy/We4XGX2CYX8
    else:
        counter += 1

If there are no references to DataEntry instances from anywhere they should be garbage collected by Python interpreter at some point.

However, if all you're doing in that view is inserting new records to the database, it may be much more efficient to use SQLAlchemy Core constructs or literal SQL to bulk-insert data. This would also get rid of the problem with your ORM instances eating up your RAM. See I’m inserting 400,000 rows with the ORM and it’s really slow! for details.




回答2:


So I tried a bunch of things and, although using SQLAlchemy's built in functionality to solve this was probably possible I could not find any way of pulling that off.

So here's an outline of what I did:

  1. seperate the lines to be processed into batches
  2. for each batch of lines queue up a celery task to deal with those lines
  3. in the celery task a seperate process is launched that does the necessary stuff with the lines.

Reasoning:

  1. The batch stuff is obvious
  2. Celery was used because it took a heck of a long time to process an entire file so queuing just made sense
  3. the task launched a separate process because if it didn't then I had the same problem that I had with the pyramid application

Some code:

Celery task:

def commit_lines(lLineData,dSettings,cwd):
    """
    writes the line data to a file then calls a process that reads the file and creates
    the necessary data entries. Then deletes the file
    """
    import lockfile
    sFileName = "/home/sheena/tmp/cid_line_buffer"
    lock = lockfile.FileLock("{0}_lock".format(sFileName))
    with lock:
        f = open(sFileName,'a') #in case the process was at any point interrupted...
        for d in lLineData:
            f.write('{0}\n'.format(d))
        f.close()

    #now call the external process
    import subprocess
    import os
    sConnectionString = dSettings.get('sqlalchemy.url')
    lArgs = [
                'python',os.path.join(cwd,'commit_line_file.py'),
                '-c',sConnectionString,
                '-f',sFileName
        ]
    #open the subprocess. wait for it to complete before continuing with stuff. if errors: raise
    subprocess.check_call(lArgs,shell=False)
    #and clear the file
    lock = lockfile.FileLock("{0}_lock".format(sFileName))
    with lock:
        f = open(sFileName,'w')
        f.close()

External process:

"""
this script goes through all lines in a file and creates data entries from the lines
"""
def main():
    from optparse import OptionParser
    from sqlalchemy import create_engine
    from pyramidapp.models import Base,DBSession

    import ast
    import transaction

    #get options

    oParser = OptionParser()
    oParser.add_option('-c','--connection_string',dest='connection_string')
    oParser.add_option('-f','--input_file',dest='input_file')
    (oOptions, lArgs) = oParser.parse_args()

    #set up connection

    #engine = engine_from_config(dSettings, 'sqlalchemy.')
    engine = create_engine(
        oOptions.connection_string,
        echo=False)
    DBSession.configure(bind=engine)
    Base.metadata.bind = engine

    #commit stuffs
    import lockfile
    lock = lockfile.FileLock("{0}_lock".format(oOptions.input_file))
    with lock:
        for sLine in open(oOptions.input_file,'r'):
            dLine = ast.literal_eval(sLine)
            create_entry(**dLine)

    transaction.commit()

def create_entry(iDS,oStartDate,oEndDate,lTitles,lValues):
    #import stuff
    oEntry = DataEntry()
    #do some other stuff, make more model instances...
    DBSession.add(oEntry)


if __name__ == "__main__":
    main()

in the view:

 for line in big_giant_csv_file_handler:
     lLineData.append({'stuff':'lots'})

 if lLineData:
            lLineSets = [lLineData[i:i+iBatchSize] for i in range(0,len(lLineData),iBatchSize)]
            for l in lLineSets:
                commit_lines.delay(l,dSettings,sCWD)  #queue it for celery



回答3:


You are just doing it wrong. Period.

Quoted from SQLAlchemy docs

The advanced developer will try to keep the details of session, transaction and exception management as far as possible from the details of the program doing its work.

Quoted from Pyramid docs

We made the decision to use SQLAlchemy to talk to our database. We also, though, installed pyramid_tm and zope.sqlalchemy.

Why?

Pyramid has a strong orientation towards support for transactions. Specifically, you can install a transaction manager into your app application, either as middleware or a Pyramid "tween". Then, just before you return the response, all transaction-aware parts of your application are executed. This means Pyramid view code usually doesn't manage transactions.

My answer today is not code, but a recommendation to follow best practices recommended by the authors of the packages/frameworks you are working with.

References

  • Big picture - Using Thread-Local Scope with Web Applications
  • Typical error message when doing it wrong
  • Databases using SQLAlchemy
  • How to use scoped_session



回答4:


Encapsulate CSV reading and creating SQLAlchemy model instances into something that supports the iterator protocol. I called it BatchingModelReader. It returns a collection of DataEntry instances, collection size depends on batch size. If the model changes overtime, you do not need to change the celery task. The task only puts a batch of models into a session and commits the transaction. By controlling the batch size you control memory consumption. Neither BatchingModelReader nor the celery task save huge amounts of intermediate data. This example shows as well that using celery is only an option. I added links to code samples of an pyramid application I am actually refactoring in a Github fork.

BatchingModelReader - encapsulates csv.reader and uses existing models from your pyramid application

get inspired by source code of csv.DictReader

could be run as a celery task - use appropriate task decorator

from .models import DBSession
import transaction

def import_from_csv(path_to_csv, batchsize)
    """given a CSV file and batchsize iterate over batches of model instances and import them to database"""
    for batch in BatchingModelReader(path_to_csv, batchsize):
        with transaction.manager:
            DBSession.add_all(batch)

pyramid view - just save big giant CSV file, start task, return response immediately

@view_config(...):
def view(request):
    """gets file from request, save it to filesystem and start celery task"""
    with open(path_to_csv, 'w') as f:
        f.write(big_giant_csv_file)

    #start task with parameters
    import_from_csv.delay(path_to_csv, 1000)

Code samples

  • ToDoPyramid - commit transaction from commandline
  • ToDoPyramid - commit transaction from request

Pyramid using SQLAlchemy

  • Databases using SQLAlchemy

SQLAlchemy internals

  • Big picture - Using Thread-Local Scope with Web Applications
  • How to use scoped_session


来源:https://stackoverflow.com/questions/21425879/how-to-commit-model-instances-and-remove-them-from-working-memory-a-few-at-a-tim

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!