问题
I recently started experimenting with both Spark and Java. I initially went through the famous WordCount
example using RDD
and everything went as expected. Now I am trying to implement my own example but using DataFrames and not RDDs.
So I am reading a dataset from a file with
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("delimiter", ";")
.option("header", "true")
.load(inputFilePath);
and then I try to select a specific column and apply a simple transformation to every row like that
df = df.select("start")
.map(text -> text + "asd");
But the compilation finds a problem with the second row which I don't fully understand (The start column is inferred as of type string
).
Multiple non-overriding abstract methods found in interface scala.Function1
Why is my lambda function treated as a Scala function and what does the error message actually mean?
回答1:
If you use the select
function on a dataframe you get a dataframe back. Then you apply a function on the Row
datatype not the value of the row. Afterwards you should get the value first so you should do the following:
df.select("start").map(el->el.getString(0)+"asd")
But you will get an RDD as return value not a DF
回答2:
I use concat to achieve this
df.withColumn( concat(col('start'), lit('asd'))
As you're mapping the same text twice I'm not sure if you're also looking to replace the first part of the string? but if you are, I would do:
df.withColumn('start', concat(
when(col('start') == 'text', lit('new'))
.otherwise(col('start))
, lit('asd')
)
This solution scales up when using big data, as it's concatinating two columns instead of iterating over values.
来源:https://stackoverflow.com/questions/42561084/trying-to-use-map-on-a-spark-dataframe