pyspark: sparse vectors to scipy sparse matrix

喜欢而已 提交于 2019-12-18 12:30:17

问题


I have a spark dataframe with a column of short sentences, and a column with a categorical variable. I'd like to perform tf-idf on the sentences, one-hot-encoding on the categorical variable and then output it to a sparse matrix on my driver once it's much smaller in size (for a scikit-learn model).

What is the best way to get the data out of spark in sparse form? It seems like there is only a toArray() method on sparse vectors, which outputs numpy arrays. However, the docs do say that scipy sparse arrays can be used in the place of spark sparse arrays.

Keep in mind also that the tf_idf values are in fact a column of sparse arrays. Ideally it would be nice to get all these features into one large sparse matrix.


回答1:


One possible solution can be expressed as follows:

  • convert features to RDD and extract vectors:

    from pyspark.ml.linalg import SparseVector
    from operator import attrgetter
    
    df = sc.parallelize([
        (SparseVector(3, [0, 2], [1.0, 3.0]), ),
        (SparseVector(3, [1], [4.0]), )
    ]).toDF(["features"])
    
    features = df.rdd.map(attrgetter("features"))
    
  • add row indices:

    indexed_features = features.zipWithIndex()
    
  • flatten to RDD of tuples (i, j, value):

    def explode(row):
        vec, i = row
        for j, v in zip(vec.indices, vec.values):
            yield i, j, v
    
    entries = indexed_features.flatMap(explode)
    
  • collect and reshape:

    row_indices, col_indices, data = zip(*entries.collect())
    
  • compute shape:

    shape = (
        df.count(),
        df.rdd.map(attrgetter("features")).first().size
    )
    
  • create sparse matrix:

    from scipy.sparse import csr_matrix
    
    mat = csr_matrix((data, (row_indices, col_indices)), shape=shape)
    
  • quick sanity check:

    mat.todense()
    

    With expected result:

    matrix([[ 1.,  0.,  3.],
            [ 0.,  4.,  0.]])
    

Another one:

  • convert each row of features to matrix:

    import numpy as np
    
    def as_matrix(vec):
        data, indices = vec.values, vec.indices
        shape = 1, vec.size
        return csr_matrix((data, indices, np.array([0, vec.values.size])), shape)
    
    mats = features.map(as_matrix)
    
  • and reduce with vstack:

    from scipy.sparse import vstack
    
    mat = mats.reduce(lambda x, y: vstack([x, y]))
    

    or collect and vstack

    mat = vstack(mats.collect())
    


来源:https://stackoverflow.com/questions/40557577/pyspark-sparse-vectors-to-scipy-sparse-matrix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!