Writing Efficient Queries in SAS Using Proc sql with Teradata

后端 未结 5 514
[愿得一人]
[愿得一人] 2021-02-06 11:58

EDIT: Here is a more complete set of code that shows exactly what\'s going on per the answer below.

libname output \'/data/files/jeff\'
%let DateStart = \'01Jan         


        
相关标签:
5条回答
  • 2021-02-06 12:21

    If the id is unique you might add a UNIQUE PRIMARY INDEX(id) to that table, otherwise it defaults to a Non-unique PI. Knowing about uniquenes helps the optimizer to produce a better plan.

    Without more info like an Explain (just put EXPLAIN in front of the SELECT) it's hard to tell how this can be improved.

    0 讨论(0)
  • 2021-02-06 12:26

    The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. The idea is to try and relieve you (the user) from all the database specific details. SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character.

    SAS deals with translating things for you but it can be confusing. For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. Even with data set compression, this can really make the resulting SAS dataset unnecessarily large.

    The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result:

    proc sql;
       connect to teradata (user=userid password=password mode=teradata);
       create table mydata as
       select * from connection to teradata (
          select a.customer_id
               , a.customer_name
               , b.last_payment_date
               , b.last_payment_amt
          from base.customers a
          join base.invoices b
          on a.customer_id=b.customer_id
          where b.bill_month = date '2013-07-01'
            and b.paid_flag = 'N'
          );
    quit;
    

    Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database.

    The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. To better assist, you need to show the real program, including any library references. For example, suppose your real program looks like this:

    proc sql;
       CREATE TABLE subset_data AS
       SELECT bigTable.id,
              SUM(bigTable.value) AS total
       FROM   TDATA.bigTable bigTable
       JOIN   TDATA.subset subset
       ON     subset.id = bigTable.id
       WHERE  bigTable.date BETWEEN a AND b
       GROUP BY bigTable.id
       ;
    

    That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server.

    One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly.

    I could go on and on but I'll stop here. I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. I enjoy programming in both.

    0 讨论(0)
  • 2021-02-06 12:27

    One alternate solution is to use SAS procedures. I don't know what your actual SQL is doing, but if you're just doing frequencies (or something else that can be done in a PROC), you could do:

    proc sql;
    create view blah as select ... (your join);
    quit;
    
    proc freq data=blah;
    tables id/out=summary(rename=count=total keep=id count);
    run;
    

    Or any number of other options (PROC MEANS, PROC TABULATE, etc.). That may be faster than doing the sum in SQL (depending on some details, such as how your data is organized, what you're actually doing, and how much memory you have available). It has the added benefit that SAS might choose to do this in-database, if you create the view in the database, which might be faster. (In fact, if you just run the freq off the base table, it's possible that would be even faster, and then join the results to the smaller table).

    0 讨论(0)
  • 2021-02-06 12:32

    You imply an assumption that the 90k records in your first query are all unique ids. Is that definite?

    I ask because the implication from your second query is that they're not unique.
    - One id can have multiple values over time, and have different somevalues

    If the ids are not unique in the first dataset, you need to GROUP BY id or use DISTINCT, in the first query.

    Imagine that the 90k rows consists of 30k unique ids, and so have an average of 3 rows per id.

    And then imagine those 30k unique ids actually have 9 records in your time window, including rows where somevalue <> x.

    You will then get 3x9 records back per id.

    And as those two numbers grow, the number of records in your second query grows geometrically.


    Alternative Query

    If that's not the problem, an alternative query (which is not ideal, but possible) would be...

    SELECT
      bigTable.id,
      SUM(bigTable.value) AS total
    FROM
      bigTable
    WHERE
      bigTable.date BETWEEN a AND b
    GROUP BY
      bigTable.id
    HAVING
      MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1
    
    0 讨论(0)
  • 2021-02-06 12:37

    If ID is unique and a single value, then you can try constructing a format.

    Create a dataset that looks like this:

    fmtname, start, label

    where fmtname is the same for all records, a legal format name (begins and ends with a letter, contains alphanumeric or _); start is the ID value; and label is a 1. Then add one row with the same value for fmtname, a blank start, a label of 0, and another variable, hlo='o' (for 'other'). Then import into proc format using the CNTLIN option, and you now have a 1/0 value conversion.

    Here's a brief example using SASHELP.CLASS. ID here is name, but it can be numeric or character - whichever is right for your use.

    data for_fmt;
    set sashelp.class;
    retain fmtname '$IDF'; *Format name is up to you.  Should have $ if ID is character, no $ if numeric;
    start=name; *this would be your ID variable - the look up;
    label='1';
    output;
    if _n_ = 1 then do;
      hlo='o';
      call missing(start);
      label='0';
      output;
    end;
    run;
    proc format cntlin=for_fmt;
    quit;
    

    Now instead of doing a join, you can do your query 'normally' but with an additional where clause of and put(id,$IDF.)='1'. This won't be optimized with an index or anything, but it may be faster than the join. (It may also not be faster - depends on how the SQL optimizer is working.)

    0 讨论(0)
提交回复
热议问题