Compute differences between succesive records in Hadoop with Hive Queries

前端未结

关注

 3  1593

I have a Hive table that holds data of customer calls. For simplicity consider it has 2 columns, first column holds the customer ID and the second column holds the timestamp

相关标签:

3条回答

萌比男神i

2021-01-12 05:10

You can use explicit MAP-REDUCE with other programming language like Java or Python. Where emit from map {cutomer_id,call_time} and in reducer you will get {customer_id,list{time_stamp}} and in reducer you can sort these time stamps and can process the data.

0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2021-01-12 05:15
It's an old question, but for future references, I write here another proposition:

Hive Windowing functions allows to use previous / next values in your query.

A similar code query may be :
```
SELECT customer_id, call_time - LAG(call_time, 1, 0) OVER (PARTITION BY customer_id ORDER BY call_time) FROM mytable;
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

轮回少年

2021-01-12 05:23

Maybe someone encounters a similar requirement, the solution I found is the following:

1) Create a custom function:

package com.example;
// imports (they depend on the hive version)
@Description(name = "delta", value = "_FUNC_(customer id column, call time column) "
    + "- computes the time passed between two succesive records from the same customer. "
    + "It generates 3 columns: first contains the customer id, second contains call time "
    + "and third contains the time passed from the previous call. This function returns only "
    + "the records that have a previous call from the same customer (requirements are not applicable "
    + "to the first call)", extended = "Example:\n> SELECT _FUNC_(customer_id, call_time) AS"
    + "(customer_id, call_time, time_passed) FROM (SELECT customer_id, call_time FROM mytable "
    + "DISTRIBUTE BY customer_id SORT BY customer_id, call_time) t;")
public class DeltaComputerUDTF extends GenericUDTF {
private static final int NUM_COLS = 3;

private Text[] retCols; // array of returned column values
private ObjectInspector[] inputOIs; // input ObjectInspectors
private String prevCustomerId;
private Long prevCallTime;

@Override
public StructObjectInspector initialize(ObjectInspector[] ois) throws UDFArgumentException {
    if (ois.length != 2) {
        throw new UDFArgumentException(
                "There must be 2 arguments: customer Id column name and call time column name");
    }

    inputOIs = ois;

    // construct the output column data holders
    retCols = new Text[NUM_COLS];
    for (int i = 0; i < NUM_COLS; ++i) {
        retCols[i] = new Text();
    }

    // construct output object inspector
    List<String> fieldNames = new ArrayList<String>(NUM_COLS);
    List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(NUM_COLS);
    for (int i = 0; i < NUM_COLS; ++i) {
        // column name can be anything since it will be named by UDTF as clause
        fieldNames.add("c" + i);
        // all returned type will be Text
        fieldOIs.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
    }

    return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}

@Override
public void process(Object[] args) throws HiveException {
    String customerId = ((StringObjectInspector) inputOIs[0]).getPrimitiveJavaObject(args[0]);
    Long callTime = ((LongObjectInspector) inputOIs[1]).get(args[1]);

    if (customerId.equals(prevCustomerId)) {
        retCols[0].set(customerId);
        retCols[1].set(callTime.toString());
        retCols[2].set(new Long(callTime - prevCallTime).toString());
        forward(retCols);
    }

    // Store the current customer data, for the next line
    prevCustomerId = customerId;
    prevCallTime = callTime;
}

@Override
public void close() throws HiveException {
    // TODO Auto-generated method stub

}

}

2) Create a jar containing this function. Suppose the jarname is myjar.jar.

3) Copy the jar to the machine with Hive. Suppose it is placed in /tmp

4) Define the custom function inside Hive:

ADD JAR /tmp/myjar.jar;
CREATE TEMPORARY FUNCTION delta AS 'com.example.DeltaComputerUDTF';

5) Execute the query:

SELECT delta(customer_id, call_time) AS (customer_id, call_time, time_difference) FROM 
  (SELECT customer_id, call_time FROM mytable DISTRIBUTE BY customer_id SORT BY customer_id, call_time) t;

Remarks:

a. I assumed that the call_time column stores data as bigint. In case it is string, in process function we retrieve it as string (as we do with the customerId), then parse it to Long

b. I decided to use a UDTF instead of UDF because this way it generates all the data it needs. Otherwise (with UDF) the generated data needs to be filtered to skip NULL values. So, with the UDF function (DeltaComputerUDF) described in the first edit of the original post, the query will be:

SELECT customer_id, call_time, time_difference 
FROM 
  (
    SELECT delta(customer_id, call_time) AS (customer_id, call_time, time_difference) 
    FROM 
      (
         SELECT customer_id, call_time FROM mytable
         DISTRIBUTE BY customer_id
         SORT BY customer_id, call_time
       ) t
   ) u 
WHERE time_difference IS NOT NULL;

c. Both functions (UDF and UDTF) work as desired, no matter the order of rows inside the table (so there is no requirement that table data to be sorted by customer id and call time before using delta functions)

0 讨论(0)