Why are Pandas and GeoPandas able to read a database table using a psycopg2 connection but have to rely on SQLAlchemy to write one?

孤街醉人 提交于 2020-12-25 01:30:53

问题


Context

I just get into trouble while trying to do some I/O operations on some databases from a Python3 script.

When I want to connect to a database, I habitually use psycopg2 in order to handle the connections and cursors.

My data are usually stored as Pandas DataFrames and/or GeoPandas's equivalent GeoDataFrames.

Difficulties

In order to read data from a database table;

Using Pandas:

I can rely on its .read_sql() methods which takes as a parameter con, as stated in the doc:

con : SQLAlchemy connectable (engine/connection) or database str URI
        or DBAPI2 connection (fallback mode)'
        Using SQLAlchemy makes it possible to use any DB supported by that
        library. If a DBAPI2 object, only sqlite3 is supported. The user is responsible
        for engine disposal and connection closure for the SQLAlchemy connectable. See
        `here <https://docs.sqlalchemy.org/en/13/core/connections.html>`_

Using GeoPandas:

I can rely on its .read_postigs() methods which takes as a parameter con, as stated in the doc:

con : DB connection object or SQLAlchemy engine
        Active connection to the database to query.

In order to write data to a database table;

Using Pandas:

I can rely on the .to_sql() methods which takes as a parameter con, as stated in the doc:

con : sqlalchemy.engine.Engine or sqlite3.Connection
        Using SQLAlchemy makes it possible to use any DB supported by that
        library. Legacy support is provided for sqlite3.Connection objects. The user
        is responsible for engine disposal and connection closure for the SQLAlchemy
        connectable See `here                 <https://docs.sqlalchemy.org/en/13/core/connections.html>`_

Using GeoPandas:

I can rely on the .to_sql() methods (which directly relies on the Pandas .to_sql()) which takes as a parameter con, as stated in the doc:

con : sqlalchemy.engine.Engine or sqlite3.Connection
        Using SQLAlchemy makes it possible to use any DB supported by that
        library. Legacy support is provided for sqlite3.Connection objects. The user
        is responsible for engine disposal and connection closure for the SQLAlchemy
        connectable See `here                 <https://docs.sqlalchemy.org/en/13/core/connections.html>`_

From here, I easily understand that GeoPandas is built on Pandas especially for its GeoDataFrame object, which is, shortly, a special DataFrame that can handle geographic data.

But I'm wondering why do GeoPandas has the ability to directly takes a psycopg2 connection as an argument and not Pandas and if it is planned for the latter?

And why is it neither the case for one nor the other when it comes to writing data?
I would like (as probably many of others1,2) to directly give them a psycopg2 connections argument instead of relying on SQLAlchemy engine.
Because even is this tool is really great, it makes me use two different frameworks to connect to my database and thus handle two different connection strings (and I personally prefer the way psycopg2 handles the parameters expansion from a dictionary to build a connection string properly such as; psycopg2.connect(**dict_params) vs URL injection as explained here for example: Is it possible to pass a dictionary into create_engine function in SQLAlchemy?).

Workaround

  1. I was first creating my connection string with psycopg2 from a dictionary of parameters this way:

    connParams = ("user={}", "password={}", "host={}", "port={}", "dbname={}")
    conn = ' '.join(connParams).format(*dict_params.values())
    
  2. Then I figured out it was better and more pythonic this way:

    conn = psycopg2.connect(**dict_params)
    
  3. Which I finally replaced by this, so that I can interchangeably use it to build either a psycopg2 connections, or a SQLAlchemy engine:

    def connector():
        return psycopg2.connect(**dict_params)
    

    a) Initialize a psycopg2 connection is now done by:

    conn = connector()
    curs = conn.cursor()
    

    b) And initialize a SQLAlchemy engine by:

    engine = create_engine('postgresql+psycopg2://', creator=connector)
    

(or with any of your flavored db+driver)

This is well documented here:
https://docs.sqlalchemy.org/en/13/core/engines.html#custom-dbapi-args
and here:
https://docs.sqlalchemy.org/en/13/core/engines.html#sqlalchemy.create_engine


[1] Dataframe to sql without Sql Alchemy engine
[2] How to write data frame to Postgres table without using SQLAlchemy engine?


回答1:


Probably the main reason why to_sql needs a SQLAlchemy Connectable (Engine or Connection) object is that to_sql needs to be able to create the database table if it does not exist or if it needs to be replaced. Early versions of pandas worked exclusively with DBAPI connections, but I suspect that when they were adding new features to to_sql they found themselves writing a lot of database-specific code to work around the quirks of the various DDL implementations.

On realizing that they were duplicating a lot of logic that was already in SQLAlchemy they likely decided to "outsource' all of that complexity to SQLAlchemy itself by simply accepting an Engine/Connection object and using SQLAlchemy's (database-independent) SQL Expression language to create the table.



来源:https://stackoverflow.com/questions/60148599/why-are-pandas-and-geopandas-able-to-read-a-database-table-using-a-psycopg2-conn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!