pyspark-sql | 易学教程

SQL or Pyspark - Get the last time a column had a different value for each ID

阅读更多关于 SQL or Pyspark - Get the last time a column had a different value for each ID

问题 I am using pyspark so I have tried both pyspark code and SQL. I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table: +---+-------+-------+----+ | ID|USER_ID|ADDRESS|TIME| +---+-------+-------+----+ | 1| 1| A| 10| | 2| 1| B| 15| | 3| 1| A| 20| | 4| 1| A| 40| | 5| 1| A| 45| +---+-------+-------+----+ The correct new column I would like is as below: +---+-------+-------+----+---------+ | ID|USER_ID

Create dataframe with schema provided as JSON file

阅读更多关于 Create dataframe with schema provided as JSON file

问题 How can I create a pyspark data frame with 2 JSON files? file1: this file has complete data file2: this file has only the schema of file1 data. file1 {"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"} file2 [{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},

Create dataframe with schema provided as JSON file

阅读更多关于 Create dataframe with schema provided as JSON file

PySpark: An error occurred while calling o51.showString. No module named XXX

阅读更多关于 PySpark: An error occurred while calling o51.showString. No module named XXX

问题 My pyspark version is 2.2.0. I came to a strange problem. I try to simplify it as the following. The files structure: |root |-- cast_to_float.py |-- tests |-- test.py In cast_to_float.py , my code: from pyspark.sql.types import FloatType from pyspark.sql.functions import udf def cast_to_float(y, column_name): return y.withColumn(column_name, y[column_name].cast(FloatType())) def cast_to_float_1(y, column_name): to_float = udf(cast2float1, FloatType()) return y.withColumn(column_name, to_float

Unsupported Array error when reading JDBC source in (Py)Spark?

阅读更多关于 Unsupported Array error when reading JDBC source in (Py)Spark?

问题 Trying to convert postgreSQL DB to Dataframe . Following is my code: from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Connect to DB") \ .getOrCreate() jdbcUrl = "jdbc:postgresql://XXXXXX" connectionProperties = { "user" : " ", "password" : " ", "driver" : "org.postgresql.Driver" } query = "(SELECT table_name FROM information_schema.tables) XXX" df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties) table_name_list = df.select("table_name")

How to extract column name and column type from SQL in pyspark

阅读更多关于 How to extract column name and column type from SQL in pyspark

问题 The Spark SQL for Create query is like this - CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] where [x] means x is optional. I want the output as a tuple of

How to extract column name and column type from SQL in pyspark

阅读更多关于 How to extract column name and column type from SQL in pyspark

Calculate UDF once

阅读更多关于 Calculate UDF once

问题 I want to have a UUID column in a pyspark dataframe that is calculated only once, so that I can select the column in a different dataframe and have the UUIDs be the same. However, the UDF for the UUID column is recalculated when I select the column. Here's what I'm trying to do: >>> uuid_udf = udf(lambda: str(uuid.uuid4()), StringType()) >>> a = spark.createDataFrame([[1, 2]], ['col1', 'col2']) >>> a = a.withColumn('id', uuid_udf()) >>> a.collect() [Row(col1=1, col2=2, id='5ac8f818-e2d8-4c50

How to read many tables from the same database and save them to their own CSV file?

阅读更多关于 How to read many tables from the same database and save them to their own CSV file?

问题 Below is a working code to connect to a SQL server,and save 1 table to a CSV format file. conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true"); sc = new SparkContext(conf) sqlContext = new SQLContext(sc) df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option(

Memory leaks when using pandas_udf and Parquet serialization?

阅读更多关于 Memory leaks when using pandas_udf and Parquet serialization?

问题 I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strategy in order to modify a DataFrame. That is, I would like to apply a function to each of the groups defined by a given column and finally combine them all. Problem is, the function I want to apply is a prediction method for a fitted model that "speaks" the Pandas idiom, i.e., it is vectorized and