pyspark: sparse vectors to scipy sparse matrix
问题 I have a spark dataframe with a column of short sentences, and a column with a categorical variable. I'd like to perform tf-idf on the sentences, one-hot-encoding on the categorical variable and then output it to a sparse matrix on my driver once it's much smaller in size (for a scikit-learn model). What is the best way to get the data out of spark in sparse form? It seems like there is only a toArray() method on sparse vectors, which outputs numpy arrays. However, the docs do say that scipy