Minimizing the performance issues of loading a many to many relationship

后端 未结 1 1975
不思量自难忘°
不思量自难忘° 2021-01-18 11:53

I\'ve been tokenizing an extremely large corpus. Each Unigram can occur in multiple Comments multiple times. I\'m storing the Comment.ids in a list that is attached to the U

相关标签:
1条回答
  • 2021-01-18 12:15

    The answer to your specific question (I think): http://docs.sqlalchemy.org/en/rel_0_7/orm/collections.html#dynamic-relationship-loaders

    The default behavior of relationship() is to fully load the collection of items in ... A key feature to enable management of a large collection is the so-called “dynamic” relationship. This is an optional form of relationship() which returns a Query object in place of a collection when accessed.

    It looks like SQLAlchemy does indeed support not having to read a collection to modify it. So lazy='dynamic' is correct. It is possible that the problem is that you have it only on the backref. Try these two variants:

    occurs_in = db.relationship('Comment', secondary=comments, 
        lazy='dynamic', backref=db.backref('unigrams'))
    
    occurs_in = db.relationship('Comment', secondary=comments, 
        lazy='dynamic', backref=db.backref('unigrams', lazy='dynamic'))
    

    Also, you might try lazy='noload' instead. Since you are just writing to the tables during indexing, this will work the same.

    Now, for the broader question: why do this at all? Doing it this way will be frustrating, even after you figure out this little problem. Some ideas...

    Use the right tool for the job: Sphinx, ElasticSearch, Lucene, Solr, Xapian, any one of these will handle the problem of text indexing quite thoroughly, and much better than you can handle it without using a specialized tool. Sphinx especially performs insanely fast, the indexing speed is hundreds of megabytes per second and a query of how many documents contain a word usually takes a millisecond or two (regardless of corpus size).

    If you are doing a one-off script or test code, rather than setting up a production system, and for some reason don't want to use the right tool, then do it all in memory, and don't use SQL. Use plain dictionaries in python, and save them as pickle files on a ramdisk in between runs. Buy more memory, it's cheaper than your time. This is not a bad way to test statistical ideas on a text corpus.

    If you really MUST put a text index in a SQL database for some reason (why?), then save yourself a lot of pain and don't use an object relational mapper like SQLAlchemy. The best way to do this is, prepare a data dump in a suitable format (as a text file), and load it in the database with one shot (using something like LOAD DATA INFILE in MySQL, or equivalents in your database). This is several orders of magnitude faster. It can easily be 1000x the speed of running individual INSERT queries for every unigram. You can still access the data later through SQLAlchemy, provided that you organized your tables in the right way, but while you are indexing your text you want to bypass that.

    0 讨论(0)
提交回复
热议问题