pyspark-sql

Spark: Parallelizing creation of multiple DataFrames

匆匆过客 提交于 2020-01-24 20:45:05
问题 I'm currently generating DataFrames based on a list of IDs - each query based on one ID gives back a manageable subset of a very large PostgreSQL table. I then partition that output based on the file structure I need to write out. The problem is that I'm hitting a speed limit and majorly under-utilizing my executor resources. I’m not sure if this is a matter of rethinking my architecture or if there is some simple way to get around this, but basically I want to get more parallelization of

Spark: Parallelizing creation of multiple DataFrames

北慕城南 提交于 2020-01-24 20:45:05
问题 I'm currently generating DataFrames based on a list of IDs - each query based on one ID gives back a manageable subset of a very large PostgreSQL table. I then partition that output based on the file structure I need to write out. The problem is that I'm hitting a speed limit and majorly under-utilizing my executor resources. I’m not sure if this is a matter of rethinking my architecture or if there is some simple way to get around this, but basically I want to get more parallelization of

Group By and standardize in spark

試著忘記壹切 提交于 2020-01-24 12:57:05
问题 I have the following data frame: import pandas as pd import numpy as np df = pd.DataFrame([[1,2,3],[1,2,1],[1,2,2],[2,2,2],[2,3,2],[2,4,2]],columns=["a","b","c"]) df = df.set_index("a") df.groupby("a").mean() df.groupby("a").std() I want to standardize the dataframe for each key and NOT standardize the whole column vector. So for the following example the output would be: a = 1: Column: b (2 - 2) / 0.0 (2 - 2) / 0.0 (2 - 2) / 0.0 Column: c (3 - 2) / 1.0 (1 - 2) / 1.0 (2 - 2) / 1.0 And then I

NameError: name 'dbutils' is not defined in pyspark

时光毁灭记忆、已成空白 提交于 2020-01-24 10:48:47
问题 I am running a pyspark job in databricks cloud. I need to write some of the csv files to databricks filesystem (dbfs) as part of this job and also i need to use some of the dbutils native commands like, #mount azure blob to dbfs location dbutils.fs.mount (source="...",mount_point="/mnt/...",extra_configs="{key:value}") I am also trying to unmount once the files has been written to the mount directory. But, when i am using dbutils directly in the pyspark job it is failing with NameError: name

NameError: name 'dbutils' is not defined in pyspark

独自空忆成欢 提交于 2020-01-24 10:46:26
问题 I am running a pyspark job in databricks cloud. I need to write some of the csv files to databricks filesystem (dbfs) as part of this job and also i need to use some of the dbutils native commands like, #mount azure blob to dbfs location dbutils.fs.mount (source="...",mount_point="/mnt/...",extra_configs="{key:value}") I am also trying to unmount once the files has been written to the mount directory. But, when i am using dbutils directly in the pyspark job it is failing with NameError: name

Replace value in deep nested schema Spark Dataframe

自闭症网瘾萝莉.ら 提交于 2020-01-23 15:13:07
问题 I am new to pyspark. I am trying to understand how to access parquet file with multiple level of nested struct and array's. I need to replace some value in a data-frame (with nested schema) with null, I have seen this solution it works fine with structs but it not sure how this works with arrays. My schema is something like this |-- unitOfMeasure: struct | |-- raw: struct | | |-- id: string | | |-- codingSystemId: string | | |-- display: string | |-- standard: struct | | |-- id: string | | |-

spark dataframe filter operation

社会主义新天地 提交于 2020-01-21 14:34:22
问题 I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected. Example: DataFrame columns: customer_id|col_a|col_b|col_c|col_d Filter string: col_a > 0 & col_b > 4 & col_c < 0 & col_d=0 etc... reason_for_exclusion can be any string or letter as long as it says why particular row excluded. I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so

How do I run pyspark with jupyter notebook?

做~自己de王妃 提交于 2020-01-21 05:47:06
问题 I am trying to fire the jupyter notebook when I run the command pyspark in the console. When I type it now, it only starts and interactive shell in the console. However, this is not convenient to type long lines of code. Is there are way to connect the jupyter notebook to pyspark shell? Thanks. 回答1: Assuming you have Spark installed wherever you are going to run Jupyter, I'd recommend you use findspark. Once you pip install findspark , you can just import findspark findspark.init() import

PySpark Numeric Window Group By

♀尐吖头ヾ 提交于 2020-01-20 08:20:06
问题 I'd like to be able to have Spark group by a step size, as opposed to just single values. Is there anything in spark similar to PySpark 2.x's window function for numeric (non-date) values? Something along the lines of: sqlContext = SQLContext(sc) df = sqlContext.createDataFrame([10, 11, 12, 13], "integer").toDF("foo") res = df.groupBy(window("foo", step=2, start=10)).count() 回答1: You can reuse timestamp one and express parameters in seconds. Tumbling: from pyspark.sql.functions import col,

How to maintain sort order in PySpark collect_list and collect multiple lists

余生颓废 提交于 2020-01-17 00:28:44
问题 I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data": I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code: from pyspark.sql import functions as F from pyspark.sql import Window w = Window.partitionBy('Syscode_Stn')