pyspark-sql | 易学教程

Working with jdbc jar in pyspark

阅读更多关于 Working with jdbc jar in pyspark

问题 I need to read from a postgres sql database in pyspark. I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually. I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches: pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars

How to access the HIVE ACID table in Spark sql?

阅读更多关于 How to access the HIVE ACID table in Spark sql?

问题 How could you access the HIVE ACID table, in Spark sql? 回答1: We have worked on and open sourced a datasource that will enable users to work on their Hive ACID Transactional tables using Spark. Github: https://github.com/qubole/spark-acid It is available as a Spark package and instructions to use it are on the Github page. Currently the datasource supports only reading from Hive ACID tables, and we are working on adding the ability to write into these tables via Spark as well. Feedback and

Remove an element from a Python list of lists in PySpark DataFrame

阅读更多关于 Remove an element from a Python list of lists in PySpark DataFrame

问题 I am trying to remove an element from a Python list of lists: +---------------+ | sources| +---------------+ | [62]| | [7, 32]| | [62]| | [18, 36, 62]| |[7, 31, 36, 62]| | [7, 32, 62]| I want to be able to remove an element, rm , from each of the lists in the list above. I wrote a function that can do that for a list of lists: def asdf(df, rm): temp = df for n in range(len(df)): temp[n] = [x for x in df[n] if x != rm] return(temp) which does remove rm = 1 : a = [[1,2,3],[1,2,3,4],[1,2,3,4,5]]

Apache Spark OutOfMemoryError (HeapSpace)

阅读更多关于 Apache Spark OutOfMemoryError (HeapSpace)

问题 I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group. df = spark.read.parquet('path/to/parquet/') check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8 for col, _ in check_columns.items(): total = (df .groupBy('groupID').count() .toDF(

Apache Spark OutOfMemoryError (HeapSpace)

阅读更多关于 Apache Spark OutOfMemoryError (HeapSpace)

Pyspark - Load file: Path does not exist

阅读更多关于 Pyspark - Load file: Path does not exist

问题 I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()\ df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True) When I run the script raises the following error message: pyspark.sql.utils.AnalysisException: u'Path does not exist:

How to use matplotlib to plot pyspark sql results

阅读更多关于 How to use matplotlib to plot pyspark sql results

问题 I am new to pyspark. I want to plot the result using matplotlib, but not sure which function to use. I searched for a way to convert sql result to pandas and then use plot. 回答1: Hi Team I have found the solution for this. I converted sql dataframe to pandas dataframe and then I was able to plot the graphs. below is the sample code.from pyspark.sql import Row from pyspark.sql import HiveContext import pyspark from IPython.display import display import matplotlib import matplotlib.pyplot as plt

Difference between createOrReplaceTempView and registerTempTable

阅读更多关于 Difference between createOrReplaceTempView and registerTempTable

问题 I am new to spark and was trying out a few commands in sparkSql using python when I came across these two commands: createOrReplaceTempView() and registerTempTable(). What is the difference between the two commands?. They seem to have same set of functionalities. 回答1: registerTempTable is a part of the 1.x API and has been deprecated in Spark 2.0. createOrReplaceTempView and createTempView have been introduced in Spark 2.0, as a replacement for registerTempTable . Other than that

Spark 2.0: Relative path in absolute URI (spark-warehouse)

阅读更多关于 Spark 2.0: Relative path in absolute URI (spark-warehouse)

问题 I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do: Spark 1.6 df = sqlContext.read \ .format('com.databricks.spark.csv') \ .option('header', 'true') \ .load('file:///C:/path/to/my/file.csv', schema=mySchema) In the latest release I think it should look like this: Spark 2.0 spark = SparkSession.builder \ .master('local[*]') \ .appName('My

How to pivot on multiple columns in Spark SQL?

阅读更多关于 How to pivot on multiple columns in Spark SQL?

问题 I need to pivot more than one column in a pyspark dataframe. Sample dataframe, >>> d = [(100,1,23,10),(100,2,45,11),(100,3,67,12),(100,4,78,13),(101,1,23,10),(101,2,45,13),(101,3,67,14),(101,4,78,15),(102,1,23,10),(102,2,45,11),(102,3,67,16),(102,4,78,18)] >>> mydf = spark.createDataFrame(d,['id','day','price','units']) >>> mydf.show() +---+---+-----+-----+ | id|day|price|units| +---+---+-----+-----+ |100| 1| 23| 10| |100| 2| 45| 11| |100| 3| 67| 12| |100| 4| 78| 13| |101| 1| 23| 10| |101| 2|