How can I retrieve the alias for a DataFrame in Spark

后端未结

关注

 3  1376

I\'m using Spark 2.0.2. I have a DataFrame that has an alias on it, and I\'d like to be able to retrieve that. A simplified example of why I\'d want that is below.

相关标签:

3条回答

無奈伤痛

2021-01-15 17:45

Disclaimer: as stated above, this code relies on undocumented APIs subject to change. It works as of Spark 2.3.

After much digging into mostly undocumented Spark methods, here is the full code to pull the list of fields, along with the table alias for a dataframe in PySpark:

def schema_from_plan(df):
    plan = df._jdf.queryExecution().analyzed()
    all_fields = _schema_from_plan(plan)

    iterator = plan.output().iterator()
    output_fields = {}
    while iterator.hasNext():
        field = iterator.next()
        queryfield = all_fields.get(field.exprId().id(),{})
        if not queryfield=={}:
            tablealias = queryfield["tablealias"]
        else:
            tablealias = ""
        output_fields[field.exprId().id()] = {
            "tablealias": tablealias,
            "dataType": field.dataType().typeName(),
            "name": field.name()
        }
    return list(output_fields.values())

def _schema_from_plan(root,tablealias=None,fields={}):
    iterator = root.children().iterator()
    while iterator.hasNext():
        node = iterator.next()
        nodeClass = node.getClass().getSimpleName()
        if (nodeClass=="SubqueryAlias"):
            # get the alias and process the subnodes with this alias
            _schema_from_plan(node,node.alias(),fields)
        else:
            if tablealias:
                # add all the fields, along with the unique IDs, and a new tablealias field            
                iterator = node.output().iterator()
                while iterator.hasNext():
                    field = iterator.next()
                    fields[field.exprId().id()] = {
                        "tablealias": tablealias,
                        "dataType": field.dataType().typeName(),
                        "name": field.name()
                    }
            _schema_from_plan(node,tablealias,fields)
    return fields

# example: fields = schema_from_plan(df)

0 讨论(0)

青春惊慌失措

2021-01-15 17:59

For Java:

As @veinhorn mentioned, it is also possible to get the alias in Java. Here is a utility method example:

public static <T> Optional<String> getAlias(Dataset<T> dataset){
    final LogicalPlan analyzed = dataset.queryExecution().analyzed();
    if(analyzed instanceof SubqueryAlias) {
        SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
        return Optional.of(subqueryAlias.alias());
    }
    return Optional.empty();
}

0 讨论(0)

囚心锁ツ

2021-01-15 18:06

You can try something like this but I wouldn't go so far to claim it is supported:

Spark < 2.1:

import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
import org.apache.spark.sql.Dataset

def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
  case SubqueryAlias(alias, _) => Some(alias)
  case _ => None
}

Spark 2.1+:

def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
  case SubqueryAlias(alias, _, _) => Some(alias)
  case _ => None
}

Example usage:

val plain = Seq((1, "foo")).toDF
getAlias(plain)

Option[String] = None

val aliased = plain.alias("a dataset")
getAlias(aliased)

Option[String] = Some(a dataset)

0 讨论(0)