pyspark-sql

Rolling average without timestamp in pyspark

我怕爱的太早我们不能终老 提交于 2021-01-28 11:42:08
问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'

Rolling average without timestamp in pyspark

冷暖自知 提交于 2021-01-28 11:25:33
问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'

Issue with df.show() in pyspark

﹥>﹥吖頭↗ 提交于 2021-01-28 09:19:37
问题 I have the following code: import pyspark import pandas as pd from pyspark.sql import SQLContext from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) df_pd = pd.DataFrame( data={'integers': [1, 2, 3], 'floats': [-1.0, 0.5, 2.7], 'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]} ) df = sqlCtx.createDataFrame(df_pd) df.printSchema() Runs fine until here, but when I run: df.show() It gives this error: -

Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

↘锁芯ラ 提交于 2021-01-28 08:01:10
问题 I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error? Error: Py4JJavaError: An error occurred while calling o34.csv. : java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$

Filter pyspark dataframe based on time difference between two columns

ぃ、小莉子 提交于 2021-01-28 06:42:41
问题 I have a dataframe with multiple columns, two of which are of type pyspark.sql.TimestampType . I would like to filter this dataframe to rows where the time difference between these two columns is less than one hour. I'm currently trying to do this like so: examples = data.filter((data.tstamp - data.date) < datetime.timedelta(hours=1)) But this fails with the following error message: org.apache.spark.sql.AnalysisException: cannot resolve '(`tstamp` - `date`)' due to data type mismatch: '(

PySpark first and last function over a partition in one go

北战南征 提交于 2021-01-27 19:54:45
问题 I have pyspark code like this, spark_df = spark_df.orderBy('id', 'a1', 'c1') out_df = spark_df.groupBy('id', 'a1', 'a2').agg( F.first('c1').alias('c1'), F.last('c2').alias('c2'), F.first('c3').alias('c3')) I need to keep the data ordered in the order id, a1 and c1. Then select columns as shown above over the group defined over the keys id, a1 and c1. Due to first and last non-determinism I changed the code to this ugly looking code which works but I'm not sure that is efficient. w_first =

How can see the SQL statements that SPARK sends to my database?

末鹿安然 提交于 2021-01-27 06:16:46
问题 I have a spark cluster and a vertica database. I use spark.read.jdbc( # etc to load Spark dataframes into the cluster. When I do a certain groupby function df2 = df.groupby('factor').agg(F.stddev('sum(PnL)')) df2.show() I then get a vertica syntax exception Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler

how to fetch multiple tables using spark sql

女生的网名这么多〃 提交于 2021-01-05 11:01:17
问题 I am fetching data from mysql using pyspark which for only one table.I want to fetch all tables from mysql db. Don't want call jdbc connection again and again. see code below Is it possible to simplify my code? Thank you in advance url = "jdbc:mysql://localhost:3306/dbname" table_df=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name").option("user","root").option("password", "root").load() sqlContext.registerDataFrameAsTable(table_df, "table1") table_df_1=sqlContext

how to fetch multiple tables using spark sql

僤鯓⒐⒋嵵緔 提交于 2021-01-05 11:01:10
问题 I am fetching data from mysql using pyspark which for only one table.I want to fetch all tables from mysql db. Don't want call jdbc connection again and again. see code below Is it possible to simplify my code? Thank you in advance url = "jdbc:mysql://localhost:3306/dbname" table_df=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name").option("user","root").option("password", "root").load() sqlContext.registerDataFrameAsTable(table_df, "table1") table_df_1=sqlContext

How to calculate rolling sum with varying window sizes in PySpark

空扰寡人 提交于 2020-12-29 04:45:00
问题 I have a spark dataframe that contains sales prediction data for some products in some stores over a time period. How do I calculate the rolling sum of Predictions for a window size of next N values? Input Data +-----------+---------+------------+------------+---+ | ProductId | StoreId | Date | Prediction | N | +-----------+---------+------------+------------+---+ | 1 | 100 | 2019-07-01 | 0.92 | 2 | | 1 | 100 | 2019-07-02 | 0.62 | 2 | | 1 | 100 | 2019-07-03 | 0.89 | 2 | | 1 | 100 | 2019-07-04