Combining low-latency streams with multiple meta-data streams in Flink (enrichment)

问题

I am evaluating Flink for a streaming analytics scenario and haven't found sufficient information on how to fulfil a kind of ETL setup we are doing in a legacy system today.

A very common scenario is that we have keyed, slow throughput, meta-data streams that we want to use for enrichment on high throughput data streams, something in the line of:

This raises two questions concerning Flink: How does one enrich a fast moving stream with slowly updating streams where the time windows overlap, but are not equal (Meta-data can live for days while data lives for minutes)? And how does one efficiently join multiple (up to 10) streams with Flink, say one data stream and nine different enrichment streams?

I am aware that I can fulfil my ETL scenario with non-windowed external ETL caches, for example with Redis (which is what we use today), but I wanted to see what possibilities Flink offers.

回答1:

Flink has several mechanisms that can be used for enrichment.

I'm going to assume that all of the streams share a common key that can be used to join the corresponding items.

The simplest approach is probably to use a RichFlatmap and load static enrichment data in its open() method (docs about rich functions). This is only suitable if the enrichment data is static, or if you are willing to restart the enrichment job whenever you want to update the enrichment data.

For the other approaches described below, you should store the enrichment data as managed, keyed state (see the docs about working with state in Flink). This will enable Flink to restore and resume your enrichment job in the case of failures.

Assuming you want to actually stream in the enrichment data, then a RichCoFlatmap is more appropriate. This is a stateful operator that can be used to merge or join two connected streams. However, with a RichCoFlatmap you have no ability to take the timing of the stream elements into account. If are concerned about one stream getting ahead of, or behind the other, for example, and want the enrichment to be performed in a repeatable, deterministic fashion, then using a CoProcessFunction is the right approach.

You will find a detailed example, plus code, in the Apache Flink training materials.

If you have many streams (e.g., 10) to join, you can cascade a series of these two-input CoProcessFunction operators, but that does become, admittedly, rather awkward at some point. An alternative would be to use a union operator to combine all of the meta-data streams together (note that this requires that all the streams have the same type), followed by a RichCoFlatmap or CoProcessFunction that joins this unified enrichment stream with the primary stream.

Update:

Flink's Table and SQL APIs can also be used for stream enrichment, and Flink 1.4 expands this support by adding streaming time-windowed inner joins. See Table API joins and SQL joins. For example:

SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.orderId AND
  o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime

This example joins orders with their corresponding shipments if the shipment occurred within 4 orders of the order being placed.

来源：https://stackoverflow.com/questions/47408322/combining-low-latency-streams-with-multiple-meta-data-streams-in-flink-enrichme

标签

apache-flink

flink-streaming