Spark dataframe to numpy array via udf or without collecting to driver
问题 Real life df is a massive dataframe that cannot be loaded into driver memory. Can this be done using regular or pandas udf? # Code to generate a sample dataframe from pyspark.sql import functions as F from pyspark.sql.types import * import pandas as pd import numpy as np sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['425',