I want to filter a DataFrame
using a condition related to the length of a column, this question might be very easy but I didn\'t find any related question in th
@AlbertoBonsanto : below code filters based on array size:
val input = Seq(("a1,a2,a3,a4,a5"), ("a1,a2,a3,a4"), ("a1,a2,a3"), ("a1,a2"), ("a1"))
val df = sc.parallelize(input).toDF("tokens")
val tokensArrayDf = df.withColumn("tokens", split($"tokens", ","))
tokensArrayDf.show
+--------------------+
| tokens|
+--------------------+
|[a1, a2, a3, a4, a5]|
| [a1, a2, a3, a4]|
| [a1, a2, a3]|
| [a1, a2]|
| [a1]|
+--------------------+
tokensArrayDf.filter(size($"tokens") > 3).show
+--------------------+
| tokens|
+--------------------+
|[a1, a2, a3, a4, a5]|
| [a1, a2, a3, a4]|
+--------------------+
Here is an example for String in scala:
val stringData = Seq(("Maheswara"), ("Mokshith"))
val df = sc.parallelize(stringData).toDF
df.where((length($"value")) <= 8).show
+--------+
| value|
+--------+
|Mokshith|
+--------+
df.withColumn("length", length($"value")).show
+---------+------+
| value|length|
+---------+------+
|Maheswara| 9|
| Mokshith| 8|
+---------+------+
In Spark >= 1.5 you can use size function:
from pyspark.sql.functions import col, size
df = sqlContext.createDataFrame([
(["L", "S", "Y", "S"], ),
(["L", "V", "I", "S"], ),
(["I", "A", "N", "A"], ),
(["I", "L", "S", "A"], ),
(["E", "N", "N", "Y"], ),
(["E", "I", "M", "A"], ),
(["O", "A", "N", "A"], ),
(["S", "U", "S"], )],
("tokens", ))
df.where(size(col("tokens")) <= 3).show()
## +---------+
## | tokens|
## +---------+
## |[S, U, S]|
## +---------+
In Spark < 1.5 an UDF should do the trick:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
size_ = udf(lambda xs: len(xs), IntegerType())
df.where(size_(col("tokens")) <= 3).show()
## +---------+
## | tokens|
## +---------+
## |[S, U, S]|
## +---------+
If you use HiveContext
then size
UDF with raw SQL should work with any version:
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()
## +--------------------+
## | tokens|
## +--------------------+
## |ArrayBuffer(S, U, S)|
## +--------------------+
For string columns you can either use an udf
defined above or length
function:
from pyspark.sql.functions import length
df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", ))
df.where(length(col("k")) <= 3).show()
## +---+
## | k|
## +---+
## |bar|
## +---+