Compute differences between succesive records in Hadoop with Hive Queries

前端 未结 3 1592
后悔当初
后悔当初 2021-01-12 04:26

I have a Hive table that holds data of customer calls. For simplicity consider it has 2 columns, first column holds the customer ID and the second column holds the timestamp

相关标签:
3条回答
  • 2021-01-12 05:10

    You can use explicit MAP-REDUCE with other programming language like Java or Python. Where emit from map {cutomer_id,call_time} and in reducer you will get {customer_id,list{time_stamp}} and in reducer you can sort these time stamps and can process the data.

    0 讨论(0)
  • 2021-01-12 05:15

    It's an old question, but for future references, I write here another proposition:

    Hive Windowing functions allows to use previous / next values in your query.

    A similar code query may be :

    SELECT customer_id, call_time - LAG(call_time, 1, 0) OVER (PARTITION BY customer_id ORDER BY call_time) FROM mytable;
    
    0 讨论(0)
  • 2021-01-12 05:23

    Maybe someone encounters a similar requirement, the solution I found is the following:

    1) Create a custom function:

    package com.example;
    // imports (they depend on the hive version)
    @Description(name = "delta", value = "_FUNC_(customer id column, call time column) "
        + "- computes the time passed between two succesive records from the same customer. "
        + "It generates 3 columns: first contains the customer id, second contains call time "
        + "and third contains the time passed from the previous call. This function returns only "
        + "the records that have a previous call from the same customer (requirements are not applicable "
        + "to the first call)", extended = "Example:\n> SELECT _FUNC_(customer_id, call_time) AS"
        + "(customer_id, call_time, time_passed) FROM (SELECT customer_id, call_time FROM mytable "
        + "DISTRIBUTE BY customer_id SORT BY customer_id, call_time) t;")
    public class DeltaComputerUDTF extends GenericUDTF {
    private static final int NUM_COLS = 3;
    
    private Text[] retCols; // array of returned column values
    private ObjectInspector[] inputOIs; // input ObjectInspectors
    private String prevCustomerId;
    private Long prevCallTime;
    
    @Override
    public StructObjectInspector initialize(ObjectInspector[] ois) throws UDFArgumentException {
        if (ois.length != 2) {
            throw new UDFArgumentException(
                    "There must be 2 arguments: customer Id column name and call time column name");
        }
    
        inputOIs = ois;
    
        // construct the output column data holders
        retCols = new Text[NUM_COLS];
        for (int i = 0; i < NUM_COLS; ++i) {
            retCols[i] = new Text();
        }
    
        // construct output object inspector
        List<String> fieldNames = new ArrayList<String>(NUM_COLS);
        List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(NUM_COLS);
        for (int i = 0; i < NUM_COLS; ++i) {
            // column name can be anything since it will be named by UDTF as clause
            fieldNames.add("c" + i);
            // all returned type will be Text
            fieldOIs.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
        }
    
        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
    }
    
    @Override
    public void process(Object[] args) throws HiveException {
        String customerId = ((StringObjectInspector) inputOIs[0]).getPrimitiveJavaObject(args[0]);
        Long callTime = ((LongObjectInspector) inputOIs[1]).get(args[1]);
    
        if (customerId.equals(prevCustomerId)) {
            retCols[0].set(customerId);
            retCols[1].set(callTime.toString());
            retCols[2].set(new Long(callTime - prevCallTime).toString());
            forward(retCols);
        }
    
        // Store the current customer data, for the next line
        prevCustomerId = customerId;
        prevCallTime = callTime;
    }
    
    @Override
    public void close() throws HiveException {
        // TODO Auto-generated method stub
    
    }
    
    }
    

    2) Create a jar containing this function. Suppose the jarname is myjar.jar.

    3) Copy the jar to the machine with Hive. Suppose it is placed in /tmp

    4) Define the custom function inside Hive:

    ADD JAR /tmp/myjar.jar;
    CREATE TEMPORARY FUNCTION delta AS 'com.example.DeltaComputerUDTF';
    

    5) Execute the query:

    SELECT delta(customer_id, call_time) AS (customer_id, call_time, time_difference) FROM 
      (SELECT customer_id, call_time FROM mytable DISTRIBUTE BY customer_id SORT BY customer_id, call_time) t;
    

    Remarks:

    a. I assumed that the call_time column stores data as bigint. In case it is string, in process function we retrieve it as string (as we do with the customerId), then parse it to Long

    b. I decided to use a UDTF instead of UDF because this way it generates all the data it needs. Otherwise (with UDF) the generated data needs to be filtered to skip NULL values. So, with the UDF function (DeltaComputerUDF) described in the first edit of the original post, the query will be:

    SELECT customer_id, call_time, time_difference 
    FROM 
      (
        SELECT delta(customer_id, call_time) AS (customer_id, call_time, time_difference) 
        FROM 
          (
             SELECT customer_id, call_time FROM mytable
             DISTRIBUTE BY customer_id
             SORT BY customer_id, call_time
           ) t
       ) u 
    WHERE time_difference IS NOT NULL;
    

    c. Both functions (UDF and UDTF) work as desired, no matter the order of rows inside the table (so there is no requirement that table data to be sorted by customer id and call time before using delta functions)

    0 讨论(0)
提交回复
热议问题