Trying to use map on a Spark DataFrame

问题

I recently started experimenting with both Spark and Java. I initially went through the famous WordCountexample using RDD and everything went as expected. Now I am trying to implement my own example but using DataFrames and not RDDs.

So I am reading a dataset from a file with

DataFrame df = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .option("delimiter", ";")
        .option("header", "true")
        .load(inputFilePath);

and then I try to select a specific column and apply a simple transformation to every row like that

df = df.select("start")
        .map(text -> text + "asd");

But the compilation finds a problem with the second row which I don't fully understand (The start column is inferred as of type string).

Multiple non-overriding abstract methods found in interface scala.Function1

Why is my lambda function treated as a Scala function and what does the error message actually mean?

回答1:

If you use the selectfunction on a dataframe you get a dataframe back. Then you apply a function on the Rowdatatype not the value of the row. Afterwards you should get the value first so you should do the following:

df.select("start").map(el->el.getString(0)+"asd")

But you will get an RDD as return value not a DF

回答2:

I use concat to achieve this

df.withColumn( concat(col('start'), lit('asd'))

As you're mapping the same text twice I'm not sure if you're also looking to replace the first part of the string? but if you are, I would do:

df.withColumn('start', concat(
                      when(col('start') == 'text', lit('new'))
                      .otherwise(col('start))
                     , lit('asd')
                     )

This solution scales up when using big data, as it's concatinating two columns instead of iterating over values.

来源：https://stackoverflow.com/questions/42561084/trying-to-use-map-on-a-spark-dataframe

标签

java

apache-spark

java-8

apache-spark-sql

spark-dataframe