How can I insert common data into a temp table from disparate schemas?

前端 未结 2 1439
囚心锁ツ
囚心锁ツ 2021-01-25 21:22

I am not sure how to solve this problem:

We import order information from a variety of online vendors ( Amazon, Newegg etc ). Each vendor has their own specific termino

相关标签:
2条回答
  • 2021-01-25 21:48

    Sounds like you need to create an abstract class that will define the basics of interacting with the data, then derive a class per database schema you need to access. This will allow the core code to operate on a single object type, and each implementation can then specify the queries in a form specific to that database schema.

    Something like:

    public class Order
    {
        private String orderNumber;
        private BigDecimal orderTotal;
        ... etc ...
    }
    
    public abstract class AbstractOrderInformation
    {
      public abstract ArrayList<Order> getOrders();
      ...
    }
    

    with a Newegg class:

    public class NeweggOrderInformation extends AbstractOrderInformation
    {
       public ArrayList<Order> getOrders() {
          ... do the work of getting the newegg order
       }
     ...
    }
    

    Then you can have an arbitrarily large number of formats and when you need information, you can just iterate over all the implementations and get the Orders from each.

    0 讨论(0)
  • 2021-01-25 21:49

    First, you can create a VIEW to provide this functionality:

    CREATE VIEW orders AS
    SELECT '1'::int            AS source -- or any other tag to identify source
          ,"OrderNumber"::text AS order_nr
          ,"InvoiceNumber"     AS tansaction_id -- no cast .. is int already
          ,"OrderDate" AT TIME ZONE 'UTC' AS purchase_date -- !! see explanation
    FROM   tbl_newegg
    
    UNION  ALL  -- not UNION!
    SELECT 2
           "amazonOrderId"
          ,"merchant-order-id"
          ,"purchase-date"
    FROM   tbl_amazon;
    

    You can query this view like any other table:

    SELECT * FROM orders WHERE order_nr = 123 AND source = 2;
    
    • The source is necessary if the order_nr is not unique. How else would you guarantee unique order-numbers over different sources?

    • A timestamp without time zone is an ambiguous in a global context. It's only good in connection with its time zone. If you mix timestamp and timestamptz, you need to place the timestamp at a certain time zone with the AT TIME ZONE construct to make this work. For more explanation read this related answer.

      I use UTC as time zone, you might want to provide a different one. A simple cast "OrderDate"::timestamptz would assume your current time zone. AT TIME ZONE applied to a timestamp results in timestamptz. That's why I did not add another cast.

    • While you can, I advise not to use camel-case identifiers in PostgreSQL ever. Avoids many kinds of possible confusion. Note the lower case identifiers (without the now unnecessary double-quotes) I supplied.

    • Don't use varchar(25) as type for the order_nr. Just use text without arbitrary length modifier if it has to be a string. If all order numbers consist of digits exclusively, integer or bigint would be faster.

    Performance

    One way to make this fast would be to materialize the view. I.e., write the result into a (temporary) table:

    CREATE TEMP TABLE tmp_orders AS
    SELECT * FROM orders;
    
    ANALYZE tmp_orders; -- temp tables are not auto-analyzed!
    
    ALTER TABLE tmp_orders
    ADD constraint orders_pk PRIMARY KEY (order_nr, source);
    

    You need an index. In my example, the primary key constraint provides the index automatically.

    If your tables are big, make sure you have enough temporary buffers to handle this in RAM before you create the temp table. Else it will actually slow you down.

    SET temp_buffers = 1000MB;
    

    Has to be the first call to temp objects in your session. Don't set it high globally, just for your session. A temp table is dropped automatically at the end of your session anyway.

    To get an estimate how much RAM you need, create the table once and measure:

    SELECT pg_size_pretty(pg_total_relation_size('tmp_orders'));
    

    More on object sizes under this related question on dba.SE.

    All the overhead only pays if you have to process a number of queries within one session. For other use cases there are other solutions. If you know the source table at the time of the query, it would be much faster to direct your query to the source table instead. If you don't, I would question the uniqueness of your order_nr once more. If it is, in fact, guaranteed to be unique you can drop the column source I introduced.

    For only one or a few queries, it might be faster to use the view instead of the materialized view.

    I would also consider a plpgsql function that queries one table after the other until the record is found. Might be cheaper for a couple of queries, considering the overhead. Indexes for every table needed of course.

    Also, if you stick to text or varchar for your order_nr, consider COLLATE "C" for it.

    0 讨论(0)
提交回复
热议问题