pyspark-sql

Read in CSV in Pyspark with correct Datatypes

安稳与你 提交于 2020-01-13 10:59:08
问题 When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this: "Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey" 149332,"15.11.2005",1,199.95,107,127998739,100000 I have found code that should work in this question, but when I execute it all the entries are returned as NULL . I use the following to create a custom schema:

PySpark- How to use a row value from one column to access another column which has the same name as of the row value

荒凉一梦 提交于 2020-01-13 06:18:11
问题 I have a PySpark df: +---+---+---+---+---+---+---+---+ | id| a1| b1| c1| d1| e1| f1|ref| +---+---+---+---+---+---+---+---+ | 0| 1| 23| 4| 8| 9| 5| b1| | 1| 2| 43| 8| 10| 20| 43| e1| | 2| 3| 15| 0| 1| 23| 7| b1| | 3| 4| 2| 6| 11| 5| 8| d1| | 4| 5| 6| 7| 2| 8| 1| f1| +---+---+---+---+---+---+---+---+ I eventually want to create another column "out" whose values are based on "ref" column. For example, in the first row ref column has b1 as value. In "out" column i would like to see column "b1"

Pyspark: filter dataframe by regex with string formatting?

孤街醉人 提交于 2020-01-12 01:47:09
问题 I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword

Pyspark: filter dataframe by regex with string formatting?

馋奶兔 提交于 2020-01-12 01:47:04
问题 I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword

PySpark: How to create a nested JSON from spark data frame?

怎甘沉沦 提交于 2020-01-10 02:21:08
问题 I am trying to create a nested json from my spark dataframe which has data in following structure. The below code is creating a simple json with key and value. Could you please help df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True) Update1: As per @MaxU answer,I converted the spark data frame to pandas and used group by. It is putting the last two fields in a nested array. How could i first put the category and count in nested array and then inside

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

廉价感情. 提交于 2020-01-09 09:18:54
问题 I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this: files = ['s3a://dev/2017/01/03/data.parquet', 's3a://dev/2017/01/02/data.parquet'] df = session.read.parquet(*files) This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

主宰稳场 提交于 2020-01-09 09:17:26
问题 I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this: files = ['s3a://dev/2017/01/03/data.parquet', 's3a://dev/2017/01/02/data.parquet'] df = session.read.parquet(*files) This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as

How to convert json to pyspark dataframe (faster implementation) [duplicate]

99封情书 提交于 2020-01-07 03:47:06
问题 This question already has answers here : reading json file in pyspark (3 answers) Closed 2 years ago . I have json data in form of {'abc':1, 'def':2, 'ghi':3} How to convert it into pyspark dataframe in python? 回答1: import json j = {'abc':1, 'def':2, 'ghi':3} a=[json.dumps(j)] jsonRDD = sc.parallelize(a) df = spark.read.json(jsonRDD) >>> df.show() +---+---+---+ |abc|def|ghi| +---+---+---+ | 1| 2| 3| +---+---+---+ 来源: https://stackoverflow.com/questions/44456076/how-to-convert-json-to-pyspark

java.io.IOException: Cannot run program “python”: CreateProcess error=2, The system cannot find the file specified

不打扰是莪最后的温柔 提交于 2020-01-06 08:23:56
问题 I configured eclipse with pyspark i am using latest version of SPARK and PYTHON. when i try to code something and run. i get below error. java.io.IOException: Cannot run program "python": CreateProcess error=2, The system cannot find the file specified code i have written is below ''' Created on 23-Dec-2017 @author: lenovo ''' from pyspark import SparkContext,SparkConf from builtins import int #from org.spark.com.PySparkDemo import data from pyspark.sql import Row from pyspark.sql.context

Drop function not working after left outer join in pyspark

泪湿孤枕 提交于 2020-01-06 03:26:33
问题 My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority . I am creating my dataframes like this: a = "select 123 as id, 1 as priority" a_df = spark.sql(a) b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority" b_df = spark.sql(b) c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority) c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int] The drop function is not removing