pyspark: sparse vectors to scipy sparse matrix

南笙酒味 提交于 2019-11-30 07:19:27

One possible solution can be expressed as follows:

  • convert features to RDD and extract vectors:

    from pyspark.ml.linalg import SparseVector
    from operator import attrgetter
    
    df = sc.parallelize([
        (SparseVector(3, [0, 2], [1.0, 3.0]), ),
        (SparseVector(3, [1], [4.0]), )
    ]).toDF(["features"])
    
    features = df.rdd.map(attrgetter("features"))
    
  • add row indices:

    indexed_features = features.zipWithIndex()
    
  • flatten to RDD of tuples (i, j, value):

    def explode(row):
        vec, i = row
        for j, v in zip(vec.indices, vec.values):
            yield i, j, v
    
    entries = indexed_features.flatMap(explode)
    
  • collect and reshape:

    row_indices, col_indices, data = zip(*entries.collect())
    
  • compute shape:

    shape = (
        df.count(),
        df.rdd.map(attrgetter("features")).first().size
    )
    
  • create sparse matrix:

    from scipy.sparse import csr_matrix
    
    mat = csr_matrix((data, (row_indices, col_indices)), shape=shape)
    
  • quick sanity check:

    mat.todense()
    

    With expected result:

    matrix([[ 1.,  0.,  3.],
            [ 0.,  4.,  0.]])
    

Another one:

  • convert each row of features to matrix:

    import numpy as np
    
    def as_matrix(vec):
        data, indices = vec.values, vec.indices
        shape = 1, vec.size
        return csr_matrix((data, indices, np.array([0, vec.values.size])), shape)
    
    mats = features.map(as_matrix)
    
  • and reduce with vstack:

    from scipy.sparse import vstack
    
    mat = mats.reduce(lambda x, y: vstack([x, y]))
    

    or collect and vstack

    mat = vstack(mats.collect())
    
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!