pyspark-sql | 易学教程

Spark: Parallelizing creation of multiple DataFrames

阅读更多关于 Spark: Parallelizing creation of multiple DataFrames

问题 I'm currently generating DataFrames based on a list of IDs - each query based on one ID gives back a manageable subset of a very large PostgreSQL table. I then partition that output based on the file structure I need to write out. The problem is that I'm hitting a speed limit and majorly under-utilizing my executor resources. I’m not sure if this is a matter of rethinking my architecture or if there is some simple way to get around this, but basically I want to get more parallelization of

Spark: Parallelizing creation of multiple DataFrames

阅读更多关于 Spark: Parallelizing creation of multiple DataFrames

Group By and standardize in spark

阅读更多关于 Group By and standardize in spark

问题 I have the following data frame: import pandas as pd import numpy as np df = pd.DataFrame([[1,2,3],[1,2,1],[1,2,2],[2,2,2],[2,3,2],[2,4,2]],columns=["a","b","c"]) df = df.set_index("a") df.groupby("a").mean() df.groupby("a").std() I want to standardize the dataframe for each key and NOT standardize the whole column vector. So for the following example the output would be: a = 1: Column: b (2 - 2) / 0.0 (2 - 2) / 0.0 (2 - 2) / 0.0 Column: c (3 - 2) / 1.0 (1 - 2) / 1.0 (2 - 2) / 1.0 And then I

NameError: name 'dbutils' is not defined in pyspark

阅读更多关于 NameError: name 'dbutils' is not defined in pyspark

问题 I am running a pyspark job in databricks cloud. I need to write some of the csv files to databricks filesystem (dbfs) as part of this job and also i need to use some of the dbutils native commands like, #mount azure blob to dbfs location dbutils.fs.mount (source="...",mount_point="/mnt/...",extra_configs="{key:value}") I am also trying to unmount once the files has been written to the mount directory. But, when i am using dbutils directly in the pyspark job it is failing with NameError: name

NameError: name 'dbutils' is not defined in pyspark

阅读更多关于 NameError: name 'dbutils' is not defined in pyspark

Replace value in deep nested schema Spark Dataframe

阅读更多关于 Replace value in deep nested schema Spark Dataframe

问题 I am new to pyspark. I am trying to understand how to access parquet file with multiple level of nested struct and array's. I need to replace some value in a data-frame (with nested schema) with null, I have seen this solution it works fine with structs but it not sure how this works with arrays. My schema is something like this |-- unitOfMeasure: struct | |-- raw: struct | | |-- id: string | | |-- codingSystemId: string | | |-- display: string | |-- standard: struct | | |-- id: string | | |-

spark dataframe filter operation

阅读更多关于 spark dataframe filter operation

问题 I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected. Example: DataFrame columns: customer_id|col_a|col_b|col_c|col_d Filter string: col_a > 0 & col_b > 4 & col_c < 0 & col_d=0 etc... reason_for_exclusion can be any string or letter as long as it says why particular row excluded. I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so

How do I run pyspark with jupyter notebook?

阅读更多关于 How do I run pyspark with jupyter notebook?

问题 I am trying to fire the jupyter notebook when I run the command pyspark in the console. When I type it now, it only starts and interactive shell in the console. However, this is not convenient to type long lines of code. Is there are way to connect the jupyter notebook to pyspark shell? Thanks. 回答1: Assuming you have Spark installed wherever you are going to run Jupyter, I'd recommend you use findspark. Once you pip install findspark , you can just import findspark findspark.init() import

PySpark Numeric Window Group By

阅读更多关于 PySpark Numeric Window Group By

问题 I'd like to be able to have Spark group by a step size, as opposed to just single values. Is there anything in spark similar to PySpark 2.x's window function for numeric (non-date) values? Something along the lines of: sqlContext = SQLContext(sc) df = sqlContext.createDataFrame([10, 11, 12, 13], "integer").toDF("foo") res = df.groupBy(window("foo", step=2, start=10)).count() 回答1: You can reuse timestamp one and express parameters in seconds. Tumbling: from pyspark.sql.functions import col,

How to maintain sort order in PySpark collect_list and collect multiple lists

阅读更多关于 How to maintain sort order in PySpark collect_list and collect multiple lists

问题 I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data": I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code: from pyspark.sql import functions as F from pyspark.sql import Window w = Window.partitionBy('Syscode_Stn')