Creating binned histograms in Spark

后端未结

关注

 2  729

夕颜 2021-01-07 08:37

Suppose I have a dataframe (df) (Pandas) or RDD (Spark) with the following two columns:

timestamp, data
12345.0    10 
12346.0    12

In Pa

2条回答

隐瞒了意图╮ (楼主)

2021-01-07 09:05

In this particular case all you need is Unix timestamps and basic arithmetics:

def resample_to_minute(c, interval=1):
    t = 60 * interval
    return (floor(c / t) * t).cast("timestamp")

def resample_to_hour(c, interval=1):
    return resample_to_minute(c, 60 * interval)

df = sc.parallelize([
    ("2000-01-01 00:00:00", 0), ("2000-01-01 00:01:00", 1),
    ("2000-01-01 00:02:00", 2), ("2000-01-01 00:03:00", 3),
    ("2000-01-01 00:04:00", 4), ("2000-01-01 00:05:00", 5),
    ("2000-01-01 00:06:00", 6), ("2000-01-01 00:07:00", 7),
    ("2000-01-01 00:08:00", 8)
]).toDF(["timestamp", "data"])

(df.groupBy(resample_to_minute(unix_timestamp("timestamp"), 3).alias("ts"))
    .sum().orderBy("ts").show(3, False))

## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|3        |
## |2000-01-01 00:03:00.0|12       |
## |2000-01-01 00:06:00.0|21       |
## +---------------------+---------+

(df.groupBy(resample_to_hour(unix_timestamp("timestamp")).alias("ts"))
    .sum().orderBy("ts").show(3, False))
## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|36       |
## +---------------------+---------+

Example data from pandas.DataFrame.resample documentation.

In general case see Making histogram with Spark DataFrame column

0 讨论(0)

查看其它2个回答