I\'m trying to use an SQLAlchemy expression with dask\'s read_sql_table in order to bring down a dataset that is created by joining and filtering a few different tables. The do
For any others that run across this question. read_sql_table does not seem to support this use case (at this time). If you pass in an SQLAlchemy Select object, it ends up getting wrapped in another SQLAlchemy Select and without an alias, which is bad SQL (at least for PostgreSQL).
Looking at read_sql_table from the dask source, table is the Select object that is passed to read_sql_table and as seen, it gets wrapped in another select.
q = sql.select(columns).where(sql.and_(index >= lower, cond)
).select_from(table)
The good news is the read_sql_table function is relatively straight forward and the magic is really only a couple lines that create a dataframe from a delayed objects. You just need to write your own logic to beak the query into chunks
parts = []
for query_chunk in queries:
parts.append(delayed(_read_sql_chunk)(q, uri, meta, **kwargs))
return from_delayed(parts, meta, divisions=divisions)
def _read_sql_chunk(q, uri, meta, **kwargs):
df = pd.read_sql(q, uri, **kwargs)
if df.empty:
return meta
else:
return df.astype(meta.dtypes.to_dict(), copy=False)
The query sent on that line is auto-generated by SQLAlchemy, so the syntax ought to be correct. However, I notice that your original query includes a .limit()
modifier. The purpose of the line head =
is to get the first few rows, to infer types. If the original query already has a limit clause, I can see that the two might conflict. Please try using a query without .limit()
.
As Chris said in a different answer, Dask wraps your query in something of a form SELECT columns FROM (yourquery)
, which is an invalid syntax for PostgreSQL, because it expects an alias for that parenthesised expression. Without reimplementing the whole read_sql_table
method, the expression can be aliased simply by adding .alias('somename')
to your select, i.e.
select([t]).limit(5).alias('foo')
That expression, when wrapped by Dask, generates correct syntax for Postgres
SELECT columns FROM (yourquery) AS foo