I noticed that SQLAlchemy was slow fetching (and ORMing) some data, which was rather fast to fetch using bare bone SQL. First off, I created a database with a million record
Here is the SQLAlchemy version of your MySQL script that performs in four seconds, compared to three for MySQLdb:
from sqlalchemy import Integer, Column, create_engine, MetaData, Table
import datetime
metadata = MetaData()
foo = Table(
'foo', metadata,
Column('id', Integer, primary_key=True),
Column('a', Integer(), nullable=False),
Column('b', Integer(), nullable=False),
Column('c', Integer(), nullable=False),
)
class Foo(object):
def __init__(self, a, b, c):
self.a = a
self.b = b
self.c = c
engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)
start = datetime.datetime.now()
with engine.connect() as conn:
foos = [
Foo(row['a'], row['b'], row['c'])
for row in
conn.execute(foo.select().limit(1000000)).fetchall()
]
print "total time: ", datetime.datetime.now() - start
runtime:
total time: 0:00:04.706010
Here is a script that uses the ORM to load object rows fully; by avoiding the creation of a fixed list with all 1M objects at once using yield per, this runs in 13 seconds with SQLAlchemy master (18 seconds with rel 0.9):
import time
from sqlalchemy import Integer, Column, create_engine, Table
from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Foo(Base):
__table__ = Table(
'foo', Base.metadata,
Column('id', Integer, primary_key=True),
Column('a', Integer(), nullable=False),
Column('b', Integer(), nullable=False),
Column('c', Integer(), nullable=False),
)
engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)
sess = Session(engine)
now = time.time()
# avoid using all() so that we don't have the overhead of building
# a large list of full objects in memory
for obj in sess.query(Foo).yield_per(100).limit(1000000):
pass
print("Total time: %d" % (time.time() - now))
We can then split the difference between these two approaches, and load just individual columns with the ORM:
for obj in sess.query(Foo.id, Foo.a, Foo.b, Foo.c).yield_per(100).limit(1000000):
pass
The above again runs in 4 seconds.
The comparison of SQLAlchemy Core is the more apt comparison to a raw MySQLdb cursor. If you use the ORM but query for individual columns, it's about four seconds in most recent versions.
At the ORM level, the speed issues are because creating objects in Python is slow, and the SQLAlchemy ORM applies a large amount of bookkeeping to these objects as it fetches them, which is necessary in order for it to fulfill its usage contract, including unit of work, identity map, eager loading, collections, etc.
To speed up the query dramatically, fetch individual columns instead of full objects. See the techniques at http://docs.sqlalchemy.org/en/latest/faq/performance.html#result-fetching-slowness-orm which describe this.
For your comparison to PeeWee, PW is a much simpler system with a lot less features, including that it doesn't do anything with identity maps. Even with PeeWee, about as simple of an ORM as is feasible, it still takes 15 seconds, which is evidence that cPython is really really slow compared to the raw MySQLdb fetch which is in straight C.
For comparison to Java, the Java VM is way way way faster than cPython. Hibernate is ridiculously complicated, yet the Java VM is extremely fast due to the JIT and even all that complexity ends up running faster. If you want to compare Python to Java, use Pypy.
This is not an answer to my question, but may help the general public with speed issues on large data sets. I found that selecting a million records can typically be done in about 3 seconds, however JOINS may slow down the process. In this case that one has approximately 150k Foo's which has a 1-many relation to 1M Bars, then selecting those using a JOIN may be slow as each Foo is returned approximately 6.5 times. I found that selecting both tables seperately and joining them using dicts in python is approximately 3 times faster than SQLAlchemy (approx 25 sec) and 2 times faster than 'bare' python code using joins(approx 17 sec). The code took 8 sec in my use case. Selecting 1M records without relations, like the Bar-example above, took 3 seconds. I used this code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import MySQLdb
import sys
import time
import datetime
import inspect
from operator import itemgetter, attrgetter
# fetch all objects of class Class, where the fields are determined as the
# arguments of the __init__ constructor (not flexible, but fairly simple ;))
def fetch(Class, cursor, tablename, ids=["id"], where=None):
arguments = inspect.getargspec(Class.__init__).args; del arguments[0];
fields = ", ".join(["`" + tablename + "`.`" + column + "`" for column in arguments])
sql = "SELECT " + fields + " FROM `" + tablename + "`"
if where != None: sql = sql + " WHERE " + where
sql=sql+";"
getId = itemgetter(*[arguments.index(x) for x in ids])
elements = dict()
cursor.execute(sql)
for record in cursor:
elements[getId(record)] = Class(*record)
return elements
# attach the objects in dict2 to dict1, given a 1-many relation between both
def merge(dict1, fieldname, dict2, ids):
idExtractor = attrgetter(*ids)
for d in dict1: setattr(dict1[d], fieldname, list())
for d in dict2:
dd = dict2[d]
getattr(dict1[idExtractor(dd)], fieldname).append(dd)
# attach dict2 objects to dict1 objects, given a 1-1 relation
def attach(dict1, fieldname, dict2, ids):
idExtractor = attrgetter(*ids)
for d in dict1: dd=dict1[d]; setattr(dd, fieldname, dict2[idExtractor(dd)])
It helped me speed up my querying, however I am more than happy to hear from the experts about possible improvements to this approach.
SQLAlchemy is complicated. It has to deal with converting types to Python which the underlying database does not support natively, tables with inheritance, JOINs, caching the objects, maintaining consistency, translated rows, partial results, and whatnot. Check out sqlalchemy/orm/loading.py:instance_processor
-- it's insane.
The solution would be to piece together and compile Python code to process the results of a specific query, like Jinja2 does for templates. So far, nobody has done this work, possibly because the common case is a couple of rows (where this kind of optimization would be pessimal) and people who need to process bulk data do that by hand, like you did.