I am trying to create a dataframe using random uniform distribution in Spark. I couldn\'t find anything on how to create a dataframe but when I read the documentation I foun
To generate a random dataframe with n rows and n columns you could use the following functions
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
def generate_random_uniform_df(nrows, ncols, seed=1):
df = spark.range(nrows).select(sf.col("id"))
df = df.select('*', *(F.rand(seed).alias("_"+str(target)) for target in range(ncols)))
return df.drop("id")
and
def generate_random_normal_df(nrows, ncols, seed=1):
df = spark.range(nrows).select(sf.col("id"))
df = df.select('*', *(F.randn(seed).alias("_"+str(target)) for target in range(ncols)))
return df.drop("id")
for a standard normal distribution. However, the one suggested by eliasah
def generate_random_uniform_df(nrows, ncols):
df = RandomRDDs.uniformVectorRDD(spark.sparkContext, nrows,ncols).map(lambda a : a.tolist()).toDF()
return df
seems to be much faster.