Mapping lots of similar tables in SQLAlchemy

后端 未结 3 449
心在旅途
心在旅途 2021-01-12 02:47

I have many (~2000) locations with time series data. Each time series has millions of rows. I would like to store these in a Postgres database. My current approach is to hav

相关标签:
3条回答
  • 2021-01-12 03:06

    Alternative-1: Table Partitioning

    Partitioning immediately comes to mind as soon as I read exactly the same table structure. I am not a DBA, and do not have much production experience using it (even more so on PostgreSQL), but please read PostgreSQL - Partitioning documentation. Table partitioning seeks to solve exactly the problem you have, but over 1K tables/partitions sounds challenging; therefore please do more research on forums/SO for scalability related questions on this topic.

    Given that both of your mostly used search criterias, datetime component is very important, therefore there must be solid indexing strategy on it. If you decide to go with partitioning root, the obvious partitioning strategy would be based on date ranges. This might allow you to partition older data in different chunks compared to most recent data, especially assuming that old data is (almost never) updated, so physical layouts would be dense and efficient; while you could employ another strategy for more "recent" data.

    Alternative-2: trick SQLAlchemy

    This basically makes your sample code work by tricking SA to assume that all those TimeSeries are children of one entity using Concrete Table Inheritance. The code below is self-contained and creates 50 table with minimum data in it. But if you have a database already, it should allow you to check the performance rather quickly, so that you can make a decision if it is even a close possibility.

    from datetime import date, datetime
    
    from sqlalchemy import create_engine, Column, String, Integer, DateTime, Float, ForeignKey, func
    from sqlalchemy.orm import sessionmaker, relationship, configure_mappers, joinedload
    from sqlalchemy.ext.declarative import declarative_base, declared_attr
    from sqlalchemy.ext.declarative import AbstractConcreteBase, ConcreteBase
    
    
    engine = create_engine('sqlite:///:memory:', echo=True)
    Session = sessionmaker(bind=engine)
    session = Session()
    Base = declarative_base(engine)
    
    
    # MODEL
    class Location(Base):
        __tablename__ = 'locations'
        id = Column(Integer, primary_key=True)
        table_name = Column(String(50), unique=True)
        lon = Column(Float)
        lat = Column(Float)
    
    
    class TSBase(AbstractConcreteBase, Base):
        @declared_attr
        def table_name(cls):
            return Column(String(50), ForeignKey('locations.table_name'))
    
    
    def make_timeseries(name):
        class TimeSeries(TSBase):
            __tablename__ = name
            __mapper_args__ = { 'polymorphic_identity': name, 'concrete':True}
    
            datetime = Column(DateTime, primary_key=True)
            value = Column(Float)
    
            def __init__(self, datetime, value, table_name=name ):
                self.table_name = table_name
                self.datetime = datetime
                self.value = value
    
        return TimeSeries
    
    
    def _test_model():
        _NUM = 50
        # 0. generate classes for all tables
        TS_list = [make_timeseries('ts{}'.format(1+i)) for i in range(_NUM)]
        TS1, TS2, TS3 = TS_list[:3] # just to have some named ones
        Base.metadata.create_all()
        print('-'*80)
    
        # 1. configure mappers
        configure_mappers()
    
        # 2. define relationship
        Location.timeseries = relationship(TSBase, lazy="dynamic")
        print('-'*80)
    
        # 3. add some test data
        session.add_all([Location(table_name='ts{}'.format(1+i), lat=5+i, lon=1+i*2)
            for i in range(_NUM)])
        session.commit()
        print('-'*80)
    
        session.add(TS1(datetime(2001,1,1,3), 999))
        session.add(TS1(datetime(2001,1,2,2), 1))
        session.add(TS2(datetime(2001,1,2,8), 33))
        session.add(TS2(datetime(2002,1,2,18,50), -555))
        session.add(TS3(datetime(2005,1,3,3,33), 8))
        session.commit()
    
    
        # Query-1: get all timeseries of one Location
        #qs = session.query(Location).first()
        qs = session.query(Location).filter(Location.table_name == "ts1").first()
        print(qs)
        print(qs.timeseries.all())
        assert 2 == len(qs.timeseries.all())
        print('-'*80)
    
    
        # Query-2: select all location with data between date-A and date-B
        dateA, dateB = date(2001,1,1), date(2003,12,31)
        qs = (session.query(Location)
                .join(TSBase, Location.timeseries)
                .filter(TSBase.datetime >= dateA)
                .filter(TSBase.datetime <= dateB)
                ).all()
        print(qs)
        assert 2 == len(qs)
        print('-'*80)
    
    
        # Query-3: select all data (including coordinates) for date A
        dateA = date(2001,1,1)
        qs = (session.query(Location.lat, Location.lon, TSBase.datetime, TSBase.value)
                .join(TSBase, Location.timeseries)
                .filter(func.date(TSBase.datetime) == dateA)
                ).all()
        print(qs)
        # @note: qs is list of tuples; easy export to CSV
        assert 1 == len(qs)
        print('-'*80)
    
    
    if __name__ == '__main__':
        _test_model()
    

    Alternative-3: a-la BigData

    If you do get into performance problems using database, I would probably try:

    • still keep the data in separate tables/databases/schemas like you do right now
    • bulk-import data using "native" solutions provided by your database engine
    • use MapReduce-like analysis.
      • Here I would stay with python and sqlalchemy and implemnent own distributed query and aggregation (or find something existing). This, obviously, only works if you do not have requirement to produce those results directly on the database.

    edit-1: Alternative-4: TimeSeries databases

    I have no experience using those on a large scale, but definitely an option worth considering.


    Would be fantastic if you could later share your findings and whole decision-making process on this.

    0 讨论(0)
  • 2021-01-12 03:16

    Two parts:

    only use two tables

    there's no need to have dozens or hundreds of identical tables. just have a table for location and one for location_data , where every entry will fkey onto location. also create an index on the location_data table for the location_id, so you have efficient searching.

    don't use sqlalchemy to create this

    i love sqlalchemy. i use it every day. it's great for managing your database and adding some rows, but you don't want to use it for initial setup that has millions of rows. you want to generate a file that is compatible with postgres' "COPY" statement [ http://www.postgresql.org/docs/9.2/static/sql-copy.html ] COPY will let you pull in a ton of data fast; it's what is used during dump/restore operations.

    sqlalchemy will be great for querying this and adding rows as they come in. if you have bulk operations, you should use COPY.

    0 讨论(0)
  • 2021-01-12 03:26

    I would avoid the database design you mention above. I don't know enough about the data you are working with, but it sounds like you should have two tables. One table for location, and a child table for location_data. The location table would store the data you mention above such as coordinates and elevations. The location_data table would store the location_id from the location table as well as the time series data you want to track.

    This would eliminate changing db structure and code changes every time you add another location, and would allow the types of queries you are looking at doing.

    0 讨论(0)
提交回复
热议问题