get specific row from spark dataframe

前端 未结 9 1655
闹比i
闹比i 2020-12-17 07:58

Is there any alternative for df[100, c(\"column\")] in scala spark data frames. I want to select specific row from a column of spark data frame. for example

相关标签:
9条回答
  • 2020-12-17 08:13

    The getrows() function below should get the specific rows you want.

    For completeness, I have written down the full code in order to reproduce the output.

    # Create SparkSession
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.master('local').appName('scratch').getOrCreate()
    
    # Create the dataframe
    df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["letter", "name"])
    
    # Function to get rows at `rownums`
    def getrows(df, rownums=None):
        return df.rdd.zipWithIndex().filter(lambda x: x[1] in rownums).map(lambda x: x[0])
    
    # Get rows at positions 0 and 2.
    getrows(df, rownums=[0, 2]).collect()
    
    # Output:
    #> [(Row(letter='a', name=1)), (Row(letter='c', name=3))]
    
    0 讨论(0)
  • 2020-12-17 08:14

    This is how I achieved the same in Scala. I am not sure if it is more efficient than the valid answer, but it requires less coding

    val parquetFileDF = sqlContext.read.parquet("myParquetFule.parquet")
    
    val myRow7th = parquetFileDF.rdd.take(7).last
    
    0 讨论(0)
  • 2020-12-17 08:19

    In PySpark, if your dataset is small (can fit into memory of driver), you can do

    df.collect()[n]
    

    where df is the DataFrame object, and n is the Row of interest. After getting said Row, you can do row.myColumn or row["myColumn"] to get the contents, as spelled out in the API docs.

    0 讨论(0)
  • 2020-12-17 08:25

    When you want to fetch max value of a date column from dataframe, just the value without object type or Row object information, you can refer to below code.

    table = "mytable"

    max_date = df.select(max('date_col')).first()[0]

    2020-06-26
    instead of Row(max(reference_week)=datetime.date(2020, 6, 26))

    0 讨论(0)
  • 2020-12-17 08:28

    There is a scala way (if you have a enough memory on working machine):

    val arr = df.select("column").rdd.collect
    println(arr(100))
    

    If dataframe schema is unknown, and you know actual type of "column" field (for example double), than you can get arr as following:

    val arr = df.select($"column".cast("Double")).as[Double].rdd.collect
    
    0 讨论(0)
  • 2020-12-17 08:28

    Following is a Java-Spark way to do it , 1) add a sequentially increment columns. 2) Select Row number using Id. 3) Drop the Column

    import static org.apache.spark.sql.functions.*;
    ..
    
    ds = ds.withColumn("rownum", functions.monotonically_increasing_id());
    ds = ds.filter(col("rownum").equalTo(99));
    ds = ds.drop("rownum");
    

    N.B. monotonically_increasing_id starts from 0;

    0 讨论(0)
提交回复
热议问题