PySpark: create dataframe from random uniform disribution

后端未结

关注

 3  1871

借酒劲吻你 2021-01-13 23:01

I am trying to create a dataframe using random uniform distribution in Spark. I couldn\'t find anything on how to create a dataframe but when I read the documentation I foun

3条回答

南笙 (楼主)

2021-01-13 23:56

To generate a random dataframe with n rows and n columns you could use the following functions

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

def generate_random_uniform_df(nrows, ncols, seed=1):
    df = spark.range(nrows).select(sf.col("id"))
    df = df.select('*', *(F.rand(seed).alias("_"+str(target)) for target in range(ncols)))
    return df.drop("id")

and

def generate_random_normal_df(nrows, ncols, seed=1):
    df = spark.range(nrows).select(sf.col("id"))
    df = df.select('*', *(F.randn(seed).alias("_"+str(target)) for target in range(ncols)))
    return df.drop("id")

for a standard normal distribution. However, the one suggested by eliasah

def generate_random_uniform_df(nrows, ncols):
    df  = RandomRDDs.uniformVectorRDD(spark.sparkContext, nrows,ncols).map(lambda a : a.tolist()).toDF()
    return df

seems to be much faster.

0 讨论(0)

查看其它3个回答