pyspark-sql

Working with jdbc jar in pyspark

偶尔善良 提交于 2019-12-20 02:36:05
问题 I need to read from a postgres sql database in pyspark. I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually. I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches: pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars

How to access the HIVE ACID table in Spark sql?

别等时光非礼了梦想. 提交于 2019-12-19 11:03:44
问题 How could you access the HIVE ACID table, in Spark sql? 回答1: We have worked on and open sourced a datasource that will enable users to work on their Hive ACID Transactional tables using Spark. Github: https://github.com/qubole/spark-acid It is available as a Spark package and instructions to use it are on the Github page. Currently the datasource supports only reading from Hive ACID tables, and we are working on adding the ability to write into these tables via Spark as well. Feedback and

Remove an element from a Python list of lists in PySpark DataFrame

梦想与她 提交于 2019-12-19 08:54:32
问题 I am trying to remove an element from a Python list of lists: +---------------+ | sources| +---------------+ | [62]| | [7, 32]| | [62]| | [18, 36, 62]| |[7, 31, 36, 62]| | [7, 32, 62]| I want to be able to remove an element, rm , from each of the lists in the list above. I wrote a function that can do that for a list of lists: def asdf(df, rm): temp = df for n in range(len(df)): temp[n] = [x for x in df[n] if x != rm] return(temp) which does remove rm = 1 : a = [[1,2,3],[1,2,3,4],[1,2,3,4,5]]

Apache Spark OutOfMemoryError (HeapSpace)

走远了吗. 提交于 2019-12-19 08:06:09
问题 I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group. df = spark.read.parquet('path/to/parquet/') check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8 for col, _ in check_columns.items(): total = (df .groupBy('groupID').count() .toDF(

Apache Spark OutOfMemoryError (HeapSpace)

拈花ヽ惹草 提交于 2019-12-19 08:06:04
问题 I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group. df = spark.read.parquet('path/to/parquet/') check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8 for col, _ in check_columns.items(): total = (df .groupBy('groupID').count() .toDF(

Pyspark - Load file: Path does not exist

99封情书 提交于 2019-12-19 03:39:14
问题 I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()\ df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True) When I run the script raises the following error message: pyspark.sql.utils.AnalysisException: u'Path does not exist:

How to use matplotlib to plot pyspark sql results

我的未来我决定 提交于 2019-12-18 16:45:12
问题 I am new to pyspark. I want to plot the result using matplotlib, but not sure which function to use. I searched for a way to convert sql result to pandas and then use plot. 回答1: Hi Team I have found the solution for this. I converted sql dataframe to pandas dataframe and then I was able to plot the graphs. below is the sample code.from pyspark.sql import Row from pyspark.sql import HiveContext import pyspark from IPython.display import display import matplotlib import matplotlib.pyplot as plt

Difference between createOrReplaceTempView and registerTempTable

ⅰ亾dé卋堺 提交于 2019-12-18 12:24:17
问题 I am new to spark and was trying out a few commands in sparkSql using python when I came across these two commands: createOrReplaceTempView() and registerTempTable(). What is the difference between the two commands?. They seem to have same set of functionalities. 回答1: registerTempTable is a part of the 1.x API and has been deprecated in Spark 2.0. createOrReplaceTempView and createTempView have been introduced in Spark 2.0, as a replacement for registerTempTable . Other than that

Spark 2.0: Relative path in absolute URI (spark-warehouse)

岁酱吖の 提交于 2019-12-18 11:49:24
问题 I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do: Spark 1.6 df = sqlContext.read \ .format('com.databricks.spark.csv') \ .option('header', 'true') \ .load('file:///C:/path/to/my/file.csv', schema=mySchema) In the latest release I think it should look like this: Spark 2.0 spark = SparkSession.builder \ .master('local[*]') \ .appName('My

How to pivot on multiple columns in Spark SQL?

浪子不回头ぞ 提交于 2019-12-18 11:27:46
问题 I need to pivot more than one column in a pyspark dataframe. Sample dataframe, >>> d = [(100,1,23,10),(100,2,45,11),(100,3,67,12),(100,4,78,13),(101,1,23,10),(101,2,45,13),(101,3,67,14),(101,4,78,15),(102,1,23,10),(102,2,45,11),(102,3,67,16),(102,4,78,18)] >>> mydf = spark.createDataFrame(d,['id','day','price','units']) >>> mydf.show() +---+---+-----+-----+ | id|day|price|units| +---+---+-----+-----+ |100| 1| 23| 10| |100| 2| 45| 11| |100| 3| 67| 12| |100| 4| 78| 13| |101| 1| 23| 10| |101| 2|