Creating binned histograms in Spark

后端 未结 2 740
夕颜
夕颜 2021-01-07 08:37

Suppose I have a dataframe (df) (Pandas) or RDD (Spark) with the following two columns:

timestamp, data
12345.0    10 
12346.0    12

In Pa

相关标签:
2条回答
  • 2021-01-07 09:05

    In this particular case all you need is Unix timestamps and basic arithmetics:

    def resample_to_minute(c, interval=1):
        t = 60 * interval
        return (floor(c / t) * t).cast("timestamp")
    
    def resample_to_hour(c, interval=1):
        return resample_to_minute(c, 60 * interval)
    
    df = sc.parallelize([
        ("2000-01-01 00:00:00", 0), ("2000-01-01 00:01:00", 1),
        ("2000-01-01 00:02:00", 2), ("2000-01-01 00:03:00", 3),
        ("2000-01-01 00:04:00", 4), ("2000-01-01 00:05:00", 5),
        ("2000-01-01 00:06:00", 6), ("2000-01-01 00:07:00", 7),
        ("2000-01-01 00:08:00", 8)
    ]).toDF(["timestamp", "data"])
    
    (df.groupBy(resample_to_minute(unix_timestamp("timestamp"), 3).alias("ts"))
        .sum().orderBy("ts").show(3, False))
    
    ## +---------------------+---------+
    ## |ts                   |sum(data)|
    ## +---------------------+---------+
    ## |2000-01-01 00:00:00.0|3        |
    ## |2000-01-01 00:03:00.0|12       |
    ## |2000-01-01 00:06:00.0|21       |
    ## +---------------------+---------+
    
    (df.groupBy(resample_to_hour(unix_timestamp("timestamp")).alias("ts"))
        .sum().orderBy("ts").show(3, False))
    ## +---------------------+---------+
    ## |ts                   |sum(data)|
    ## +---------------------+---------+
    ## |2000-01-01 00:00:00.0|36       |
    ## +---------------------+---------+
    

    Example data from pandas.DataFrame.resample documentation.

    In general case see Making histogram with Spark DataFrame column

    0 讨论(0)
  • 2021-01-07 09:13

    Here is an answer using RDDs and not dataframes:

    # Generating some data to test with 
    import random
    import datetime
    
    startTS = 12345.0
    array = [(startTS+60*k, random.randrange(10, 20)) for k in range(150)]
    
    # Initializing a RDD
    rdd = sc.parallelize(array)
    
    # I first map the timestamps to datetime objects so I can use the datetime.replace 
    # method to round the times
    formattedRDD = (rdd
                    .map(lambda (ts, data): (datetime.fromtimestamp(int(ts)), data))
                    .cache())
    
    # Putting the minute and second fields to zero in datetime objects is 
    # exactly like rounding per hour. You can then reduceByKey to aggregate bins.
    hourlyRDD = (formattedRDD
                 .map(lambda (time, msg): (time.replace(minute=0, second=0), 1))
                 .reduceByKey(lambda a, b : a + b))
    
    hourlyHisto = hourlyRDD.collect()
    print hourlyHisto
    > [(datetime.datetime(1970, 1, 1, 4, 0), 60), (datetime.datetime(1970, 1, 1, 5, 0), 55), (datetime.datetime(1970, 1, 1, 3, 0), 35)]
    

    In order to do daily aggregates you can use time.date() instead of time.replace(...). Also to bin per hour starting at a not-round date-time object you can increment the original time by the delta to the nearest round hour.

    0 讨论(0)
提交回复
热议问题