How can I retrieve the alias for a DataFrame in Spark

后端 未结 3 1376
北海茫月
北海茫月 2021-01-15 17:13

I\'m using Spark 2.0.2. I have a DataFrame that has an alias on it, and I\'d like to be able to retrieve that. A simplified example of why I\'d want that is below.

相关标签:
3条回答
  • 2021-01-15 17:45

    Disclaimer: as stated above, this code relies on undocumented APIs subject to change. It works as of Spark 2.3.

    After much digging into mostly undocumented Spark methods, here is the full code to pull the list of fields, along with the table alias for a dataframe in PySpark:

    def schema_from_plan(df):
        plan = df._jdf.queryExecution().analyzed()
        all_fields = _schema_from_plan(plan)
    
        iterator = plan.output().iterator()
        output_fields = {}
        while iterator.hasNext():
            field = iterator.next()
            queryfield = all_fields.get(field.exprId().id(),{})
            if not queryfield=={}:
                tablealias = queryfield["tablealias"]
            else:
                tablealias = ""
            output_fields[field.exprId().id()] = {
                "tablealias": tablealias,
                "dataType": field.dataType().typeName(),
                "name": field.name()
            }
        return list(output_fields.values())
    
    def _schema_from_plan(root,tablealias=None,fields={}):
        iterator = root.children().iterator()
        while iterator.hasNext():
            node = iterator.next()
            nodeClass = node.getClass().getSimpleName()
            if (nodeClass=="SubqueryAlias"):
                # get the alias and process the subnodes with this alias
                _schema_from_plan(node,node.alias(),fields)
            else:
                if tablealias:
                    # add all the fields, along with the unique IDs, and a new tablealias field            
                    iterator = node.output().iterator()
                    while iterator.hasNext():
                        field = iterator.next()
                        fields[field.exprId().id()] = {
                            "tablealias": tablealias,
                            "dataType": field.dataType().typeName(),
                            "name": field.name()
                        }
                _schema_from_plan(node,tablealias,fields)
        return fields
    
    # example: fields = schema_from_plan(df)
    
    0 讨论(0)
  • 2021-01-15 17:59

    For Java:

    As @veinhorn mentioned, it is also possible to get the alias in Java. Here is a utility method example:

    public static <T> Optional<String> getAlias(Dataset<T> dataset){
        final LogicalPlan analyzed = dataset.queryExecution().analyzed();
        if(analyzed instanceof SubqueryAlias) {
            SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
            return Optional.of(subqueryAlias.alias());
        }
        return Optional.empty();
    }
    
    0 讨论(0)
  • 2021-01-15 18:06

    You can try something like this but I wouldn't go so far to claim it is supported:

    • Spark < 2.1:

      import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
      import org.apache.spark.sql.Dataset
      
      def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
        case SubqueryAlias(alias, _) => Some(alias)
        case _ => None
      }
      
    • Spark 2.1+:

      def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
        case SubqueryAlias(alias, _, _) => Some(alias)
        case _ => None
      }
      

    Example usage:

    val plain = Seq((1, "foo")).toDF
    getAlias(plain)
    
    Option[String] = None
    
    val aliased = plain.alias("a dataset")
    getAlias(aliased)
    
    Option[String] = Some(a dataset)
    
    0 讨论(0)
提交回复
热议问题