问题
System: Spark 1.3.0 (Anaconda Python dist.) on Cloudera Quickstart VM 5.4
Here's a Spark DataFrame:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',3),
('Baz',22,'US',6),
(None,75,None,7)])
schema = StructType([StructField('Name', StringType(), True),
StructField('Age', IntegerType(), True),
StructField('Country', StringType(), True),
StructField('Score', IntegerType(), True)])
df = sqlContext.createDataFrame(data,schema)
data.show()
Name Age Country Score
Foo 41 US 3
Foo 39 UK 1
Bar 57 CA 2
Bar 72 CA 3
Baz 22 US 6
null 75 null 7
However neither of these work!
df.dropna()
df.na.drop()
I get this message:
>>> df.show()
Name Age Country Score
Foo 41 US 3
Foo 39 UK 1
Bar 57 CA 2
Bar 72 CA 3
Baz 22 US 6
null 75 null 7
>>> df.dropna().show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 580, in __getattr__
jc = self._jdf.apply(name)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.apply.
: org.apache.spark.sql.AnalysisException: Cannot resolve column name "dropna" among (Name, Age, Country, Score);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Has anybody else experienced this problem? What's the workaround? Pyspark seems to thing that I am looking for a column called "na". Any help would be appreciated!
回答1:
tl;dr The methods na
and dropna
are only available since Spark 1.3.1.
Few mistakes you made:
data = sc.parallelize([....('',75,'', 7 )])
, you intended to use''
to representNone
, however, it's just a String instead of nullna
anddropna
are both methods on dataFrame class, therefore, you should call it with yourdf
.
Runnable Code:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',3),
('Baz',22,'US',6),
(None, 75, None, 7)])
schema = StructType([StructField('Name', StringType(), True),
StructField('Age', IntegerType(), True),
StructField('Country', StringType(), True),
StructField('Score', IntegerType(), True)])
df = sqlContext.createDataFrame(data,schema)
df.dropna().show()
df.na.drop().show()
回答2:
I realize that the question was asked a yr ago, in-case leaving the solution for Scala, below in-case someone lands here looking for the same
val data = sc.parallelize(List(("Foo",41,"US",3), ("Foo",39,"UK",1),
("Bar",57,"CA",2), ("Bar",72,"CA",3), ("Baz",22,"US",6), (None, 75,
None, 7)))
val schema = StructType(Array(StructField("Name", StringType, true),
StructField("Age", IntegerType, true), StructField("Country",
StringType, true), StructField("Score", IntegerType, true)))
val dat = data.map(d => Row(d._1, d._2, d._3, d._4))
val df = sqlContext.createDataFrame(dat, schema)
df.na.drop()
Note: The above solution will still fail to give the right result in Scala, not sure what is different in the implementation between Scala and python binding. na.drop works if the missing data is represented as null. It fails for "" and None. One alternative around the same is to make use of withColumn function to handle missing values of different forms
来源:https://stackoverflow.com/questions/30253550/why-does-dropna-not-work