Finding the difference of two columns in Spark dataframes and appending to a new column

╄→гoц情女王★ 提交于 2019-12-23 03:09:13

问题


Below is my code for loading csv data into dataframe and applying the difference on two columns and appending to a new one using withColumn.The two columns I am trying to find the difference is of kind Double. Please help me in figuring out the following exception:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

/**
  * Created by Guest1 on 5/10/2017.
  */
object arith extends App {
  Logger.getLogger("org").setLevel(Level.ERROR)
  Logger.getLogger("akka").setLevel(Level.ERROR)

  val spark = SparkSession.builder().appName("Arithmetics").
                config("spark.master", "local").getOrCreate()
  val df =spark.read.option("header","true")
                  .option("inferSchema",true")
                  .csv("./Input/Arith.csv").persist()

//  df.printSchema()
val sim =df("Average Total Payments") -df("Average Medicare Payments").show(5)
}

I am getting the following exception:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Average Total Payments" among (DRG Definition, Provider Id, Provider Name, Provider Street Address, Provider City, Provider State, Provider Zip Code, Hospital Referral Region Description,  Total Discharges ,  Average Covered Charges ,  Average Total Payments , Average Medicare Payments);
    at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
    at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
    at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
    at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
    at arith$.delayedEndpoint$arith$1(arith.scala:19)
    at arith$delayedInit$body.apply(arith.scala:7)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at arith$.main(arith.scala:7)
    at arith.main(arith.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

回答1:


There are multiple issues here.

First if you look at the exception, it basically tells you that there is no "Average Total Payments" column in the dataframe (it also helpfully gives you the columns it sees). It seems the column name read from the csv has an extra space at the end.

Second df("Average Total Payments") and df("Average Medicare Payments") are columns.

You are trying to call show on df("Average medicate payments"). Show is not a member of column (and on dataframe it returns unit so you couldn't do df("Average Total Payments") -df("Average Medicare Payments").show(5) anyway because that would be Column - Unit).

What you want to do is define a new column which is the difference between the two and add it to the dataframe as a new column. Then you want to select just that column and show it. For example:

val sim = df.withColumn("diff",df("Average Total Payments") -df("Average Medicare Payments"))
sim.select("diff").show(5)


来源:https://stackoverflow.com/questions/43917836/finding-the-difference-of-two-columns-in-spark-dataframes-and-appending-to-a-new

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!