pyspark-sql | 易学教程

Rolling average without timestamp in pyspark

阅读更多关于 Rolling average without timestamp in pyspark

问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'

Rolling average without timestamp in pyspark

阅读更多关于 Rolling average without timestamp in pyspark

Issue with df.show() in pyspark

阅读更多关于 Issue with df.show() in pyspark

问题 I have the following code: import pyspark import pandas as pd from pyspark.sql import SQLContext from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) df_pd = pd.DataFrame( data={'integers': [1, 2, 3], 'floats': [-1.0, 0.5, 2.7], 'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]} ) df = sqlCtx.createDataFrame(df_pd) df.printSchema() Runs fine until here, but when I run: df.show() It gives this error: -

Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

阅读更多关于 Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

问题 I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error? Error: Py4JJavaError: An error occurred while calling o34.csv. : java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$

Filter pyspark dataframe based on time difference between two columns

阅读更多关于 Filter pyspark dataframe based on time difference between two columns

问题 I have a dataframe with multiple columns, two of which are of type pyspark.sql.TimestampType . I would like to filter this dataframe to rows where the time difference between these two columns is less than one hour. I'm currently trying to do this like so: examples = data.filter((data.tstamp - data.date) < datetime.timedelta(hours=1)) But this fails with the following error message: org.apache.spark.sql.AnalysisException: cannot resolve '(`tstamp` - `date`)' due to data type mismatch: '(

PySpark first and last function over a partition in one go

阅读更多关于 PySpark first and last function over a partition in one go

问题 I have pyspark code like this, spark_df = spark_df.orderBy('id', 'a1', 'c1') out_df = spark_df.groupBy('id', 'a1', 'a2').agg( F.first('c1').alias('c1'), F.last('c2').alias('c2'), F.first('c3').alias('c3')) I need to keep the data ordered in the order id, a1 and c1. Then select columns as shown above over the group defined over the keys id, a1 and c1. Due to first and last non-determinism I changed the code to this ugly looking code which works but I'm not sure that is efficient. w_first =

How can see the SQL statements that SPARK sends to my database?

阅读更多关于 How can see the SQL statements that SPARK sends to my database?

问题 I have a spark cluster and a vertica database. I use spark.read.jdbc( # etc to load Spark dataframes into the cluster. When I do a certain groupby function df2 = df.groupby('factor').agg(F.stddev('sum(PnL)')) df2.show() I then get a vertica syntax exception Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler

how to fetch multiple tables using spark sql

阅读更多关于 how to fetch multiple tables using spark sql

问题 I am fetching data from mysql using pyspark which for only one table.I want to fetch all tables from mysql db. Don't want call jdbc connection again and again. see code below Is it possible to simplify my code? Thank you in advance url = "jdbc:mysql://localhost:3306/dbname" table_df=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name").option("user","root").option("password", "root").load() sqlContext.registerDataFrameAsTable(table_df, "table1") table_df_1=sqlContext

how to fetch multiple tables using spark sql

阅读更多关于 how to fetch multiple tables using spark sql

How to calculate rolling sum with varying window sizes in PySpark

阅读更多关于 How to calculate rolling sum with varying window sizes in PySpark

问题 I have a spark dataframe that contains sales prediction data for some products in some stores over a time period. How do I calculate the rolling sum of Predictions for a window size of next N values? Input Data +-----------+---------+------------+------------+---+ | ProductId | StoreId | Date | Prediction | N | +-----------+---------+------------+------------+---+ | 1 | 100 | 2019-07-01 | 0.92 | 2 | | 1 | 100 | 2019-07-02 | 0.62 | 2 | | 1 | 100 | 2019-07-03 | 0.89 | 2 | | 1 | 100 | 2019-07-04