spark-dataframe | 易学教程

Pyspark Dataframe get unique elements from column with string as list of elements

阅读更多关于 Pyspark Dataframe get unique elements from column with string as list of elements

问题 I have a dataframe (which is created by loading from multiple blobs in azure) where I have a column which is list of IDs. Now, I want a list of unique IDs from this entire column: Here is an example - df - | col1 | col2 | col3 | | "a" | "b" |"[q,r]"| | "c" | "f" |"[s,r]"| Here is my expected response: resp = [q, r, s] Any idea how to get there? My current approach is to convert the strings in col3 to python lists and then maybe flaten them out somehow. But so far I am not able to do so. I

Spark Get only columns that have one or more null values

阅读更多关于 Spark Get only columns that have one or more null values

问题 From a dataframe I want to get names of columns which contain at least one null value inside. Considering the dataframe below: val dataset = sparkSession.createDataFrame(Seq( (7, null, 18, 1.0), (8, "CA", null, 0.0), (9, "NZ", 15, 0.0) )).toDF("id", "country", "hour", "clicked") I want to get column names 'Country' and 'Hour'. id country hour clicked 7 null 18 1 8 "CA" null 0 9 "NZ" 15 0 回答1: this is one solution, but it's a bit awkward, I hope there is an easier way: val cols = dataset

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

阅读更多关于 How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

阅读更多关于 How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

阅读更多关于 How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

Spark - Scope, Data Frame, and memory management

阅读更多关于 Spark - Scope, Data Frame, and memory management

问题 I am curious about how scope works with Data Frame and Spark. In the example below, I have a list of file, each independently loaded in a Data Frame, some operation is performed, then, we write dfOutput to disk. val files = getListOfFiles("outputs/emailsSplit") for (file <- files){ val df = sqlContext.read .format("com.databricks.spark.csv") .option("delimiter","\t") // Delimiter is tab .option("parserLib", "UNIVOCITY") // Parser, which deals better with the email formatting .schema

Not able to set number of shuffle partition in pyspark

阅读更多关于 Not able to set number of shuffle partition in pyspark

问题 I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6. I'm loading a fairly small table with about 37K rows from hive using the following in my notebook from pyspark.sql.functions import * sqlContext.sql("set spark.sql.shuffle.partitions=10") test= sqlContext.table('some_table') print test.rdd.getNumPartitions() print test.count() The output confirms 200 tasks. From the activity log, it's spinning up

Not able to set number of shuffle partition in pyspark

阅读更多关于 Not able to set number of shuffle partition in pyspark

Can Dataframe joins in Spark preserve order?

阅读更多关于 Can Dataframe joins in Spark preserve order?

问题 I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes. From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions. How could one perform a join of two DataFrames while preserving the order of one table? E.g., +------------+---------

add columns in dataframes dynamically with column names as elements in List

阅读更多关于 add columns in dataframes dynamically with column names as elements in List

问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,