Dask read_sql_table errors out when using an SQLAlchemy expression

前端 未结 3 903
没有蜡笔的小新
没有蜡笔的小新 2021-01-23 12:55

I\'m trying to use an SQLAlchemy expression with dask\'s read_sql_table in order to bring down a dataset that is created by joining and filtering a few different tables. The do

相关标签:
3条回答
  • 2021-01-23 13:02

    For any others that run across this question. read_sql_table does not seem to support this use case (at this time). If you pass in an SQLAlchemy Select object, it ends up getting wrapped in another SQLAlchemy Select and without an alias, which is bad SQL (at least for PostgreSQL).

    Looking at read_sql_table from the dask source, table is the Select object that is passed to read_sql_table and as seen, it gets wrapped in another select.

    q = sql.select(columns).where(sql.and_(index >= lower, cond)
                                  ).select_from(table)
    

    The good news is the read_sql_table function is relatively straight forward and the magic is really only a couple lines that create a dataframe from a delayed objects. You just need to write your own logic to beak the query into chunks

    parts = []
    for query_chunk in queries:
        parts.append(delayed(_read_sql_chunk)(q, uri, meta, **kwargs))
    
    return from_delayed(parts, meta, divisions=divisions)
    
    
    def _read_sql_chunk(q, uri, meta, **kwargs):
        df = pd.read_sql(q, uri, **kwargs)
        if df.empty:
            return meta
        else:
            return df.astype(meta.dtypes.to_dict(), copy=False)
    
    0 讨论(0)
  • 2021-01-23 13:09

    The query sent on that line is auto-generated by SQLAlchemy, so the syntax ought to be correct. However, I notice that your original query includes a .limit() modifier. The purpose of the line head = is to get the first few rows, to infer types. If the original query already has a limit clause, I can see that the two might conflict. Please try using a query without .limit().

    0 讨论(0)
  • 2021-01-23 13:23

    As Chris said in a different answer, Dask wraps your query in something of a form SELECT columns FROM (yourquery), which is an invalid syntax for PostgreSQL, because it expects an alias for that parenthesised expression. Without reimplementing the whole read_sql_table method, the expression can be aliased simply by adding .alias('somename') to your select, i.e.

    select([t]).limit(5).alias('foo')
    

    That expression, when wrapped by Dask, generates correct syntax for Postgres

    SELECT columns FROM (yourquery) AS foo
    
    0 讨论(0)
提交回复
热议问题