Distributed for loop in pyspark dataframe
问题 Context : My company is in Spark 2.2 so it's not possible to use pandas_udf for distributed column processing I have dataframes that contain thousands of columns(features) and millions of records df = spark.createDataFrame([(1,"AB", 100, 200,1), (2, "AC", 150,200,2), (3,"AD", 80,150,0)],["Id","Region","Salary", "HouseHoldIncome", "NumChild"]) I would like to perform certain summary and statistics on each column in a parallel manner and wonder what is the best way to achieve this #The point is