I would like to know how this exactly works,
df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TABLE") \
.option("zkUrl", "") \
if this is loading the whole table or it will delay the loading to know if a filtering will be applied.
In the first case, how is the way to tell phoenix to filter the table before loading in the spark dataframe?
Data is not loaded until you execute an action which requires it. All filter applied in the middle:
df.where($"foo" === "bar").count
will be pushed down by Spark if it is possible. You can watch results of predicate pushdown by running explain()