问题
I have a DenseVector
RDD
like this
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])]
I want to convert this into a Dataframe
. I tried like this
>>> spark.createDataFrame(frequencyDenseVectors, ['rawfeatures']).collect()
It gives an error like this
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 520, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 360, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 340, in _inferSchema
schema = _infer_schema(first)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 991, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 968, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <type 'numpy.ndarray'>
old Solution
frequencyVectors.map(lambda vector: DenseVector(vector.toArray()))
Edit 1 - Code Reproducible
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
from pyspark.sql.functions import split
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.mllib.linalg import SparseVector, DenseVector
sqlContext = SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
sc.setLogLevel('ERROR')
sentenceData = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
sentenceData = sentenceData.withColumn("sentence", split("sentence", "\s+"))
sentenceData.show()
vectorizer = CountVectorizer(inputCol="sentence", outputCol="rawfeatures").fit(sentenceData)
countVectors = vectorizer.transform(sentenceData).select("label", "rawfeatures")
idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel = idf.fit(countVectors)
tfidf = idfModel.transform(countVectors).select("label", "features")
frequencyDenseVectors = tfidf.rdd.map(lambda vector: [vector[0],DenseVector(vector[1].toArray())])
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
回答1:
You cannot convert RDD[Vector]
directly. It should be mapped to a RDD
of objects which can be interpreted as structs
, for example RDD[Tuple[Vector]]
:
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
Otherwise Spark will try to convert object __dict__
and create use unsupported NumPy array as a field.
from pyspark.ml.linalg import DenseVector
from pyspark.sql.types import _infer_schema
v = DenseVector([1, 2, 3])
_infer_schema(v)
TypeError Traceback (most recent call last)
...
TypeError: not supported type: <class 'numpy.ndarray'>
vs.
_infer_schema((v, ))
StructType(List(StructField(_1,VectorUDT,true)))
Notes:
In Spark 2.0 you have to use correct local types:
pyspark.ml.linalg
when workingDataFrame
basedpyspark.ml
API.pyspark.mllib.linalg
when workingRDD
basedpyspark.mllib
API.
These two namespaces can no longer compatible and require explicit conversions (for example How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT).
Code provided in the edit is not equivalent to the one from the original question. You should be aware that
tuple
andlist
don't have the same semantics. If you map vector to pair usetuple
and convert directly toDataFrame
:tfidf.rdd.map( lambda row: (row[0], DenseVector(row[1].toArray())) ).toDF()
using
tuple
(product type) would work for nested structure as well but I doubt this is what you want:(tfidf.rdd .map(lambda row: (row[0], DenseVector(row[1].toArray()))) .map(lambda x: (x, )) .toDF())
list
at any other place than the top levelrow
is interpreted as anArrayType
.It is much cleaner to use an UDF for conversion (Spark Python: Standard scaler error "Do not support ... SparseVector").
回答2:
I believe the problem here is that createDataframe does not take denseVactor as argument Please try to convert denseVector into corresponding collection [i.e. Array or List]. In scala and java
toArray()
method is available you can convert the denseVector in array or list then try to create dataFrame.
来源:https://stackoverflow.com/questions/41328799/how-to-convert-rdd-of-dense-vector-into-dataframe-in-pyspark