问题
I would like append a new column on dataframe "df" from function get_distance
:
def get_distance(x, y):
dfDistPerc = hiveContext.sql("select column3 as column3, \
from tab \
where column1 = '" + x + "' \
and column2 = " + y + " \
limit 1")
result = dfDistPerc.select("column3").take(1)
return result
df = df.withColumn(
"distance",
lit(get_distance(df["column1"], df["column2"]))
)
But, I get this:
TypeError: 'Column' object is not callable
I think it happens because x and y are Column
objects and I need to be converted to String
to use in my query. Am I right? If so, how can I do this?
回答1:
You cannot use Python function on a
Column
objects directly, unless it is intended to operate onColumn
objects / expressions. You needudf
for that:@udf def get_distance(x, y): ...
But you cannot use
SQLContext
in udf (or mapper in general).Just
join
:tab = hiveContext.table("tab").groupBy("column1", "column2").agg(first("column3")) df.join(tab, ["column1", "column2"])
回答2:
Spark should know the function that you are using is not ordinary function but the UDF.
So, there are 2 ways by which we can use the UDF on dataframes.
Method-1: With @udf annotation
@udf
def get_distance(x, y):
dfDistPerc = hiveContext.sql("select column3 as column3, \
from tab \
where column1 = '" + x + "' \
and column2 = " + y + " \
limit 1")
result = dfDistPerc.select("column3").take(1)
return result
df = df.withColumn(
"distance",
lit(get_distance(df["column1"], df["column2"]))
)
Method-2: Regestering udf with pyspark.sql.functions.udf
def get_distance(x, y):
dfDistPerc = hiveContext.sql("select column3 as column3, \
from tab \
where column1 = '" + x + "' \
and column2 = " + y + " \
limit 1")
result = dfDistPerc.select("column3").take(1)
return result
calculate_distance_udf = udf(get_distance, IntegerType())
df = df.withColumn(
"distance",
lit(calculate_distance_udf(df["column1"], df["column2"]))
)
来源:https://stackoverflow.com/questions/48305443/typeerror-column-object-is-not-callable-using-withcolumn