PySpark: create dataframe from random uniform disribution

后端 未结 3 1871
借酒劲吻你
借酒劲吻你 2021-01-13 23:01

I am trying to create a dataframe using random uniform distribution in Spark. I couldn\'t find anything on how to create a dataframe but when I read the documentation I foun

3条回答
  •  南笙
    南笙 (楼主)
    2021-01-13 23:56

    To generate a random dataframe with n rows and n columns you could use the following functions

    from pyspark.sql import SparkSession
    import pyspark.sql.functions as F
    
    def generate_random_uniform_df(nrows, ncols, seed=1):
        df = spark.range(nrows).select(sf.col("id"))
        df = df.select('*', *(F.rand(seed).alias("_"+str(target)) for target in range(ncols)))
        return df.drop("id")
    

    and

    def generate_random_normal_df(nrows, ncols, seed=1):
        df = spark.range(nrows).select(sf.col("id"))
        df = df.select('*', *(F.randn(seed).alias("_"+str(target)) for target in range(ncols)))
        return df.drop("id")
    

    for a standard normal distribution. However, the one suggested by eliasah

    def generate_random_uniform_df(nrows, ncols):
        df  = RandomRDDs.uniformVectorRDD(spark.sparkContext, nrows,ncols).map(lambda a : a.tolist()).toDF()
        return df
    

    seems to be much faster.

提交回复
热议问题