pyspark-sql | 易学教程

Read in CSV in Pyspark with correct Datatypes

阅读更多关于 Read in CSV in Pyspark with correct Datatypes

问题 When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this: "Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey" 149332,"15.11.2005",1,199.95,107,127998739,100000 I have found code that should work in this question, but when I execute it all the entries are returned as NULL . I use the following to create a custom schema:

PySpark- How to use a row value from one column to access another column which has the same name as of the row value

阅读更多关于 PySpark- How to use a row value from one column to access another column which has the same name as of the row value

问题 I have a PySpark df: +---+---+---+---+---+---+---+---+ | id| a1| b1| c1| d1| e1| f1|ref| +---+---+---+---+---+---+---+---+ | 0| 1| 23| 4| 8| 9| 5| b1| | 1| 2| 43| 8| 10| 20| 43| e1| | 2| 3| 15| 0| 1| 23| 7| b1| | 3| 4| 2| 6| 11| 5| 8| d1| | 4| 5| 6| 7| 2| 8| 1| f1| +---+---+---+---+---+---+---+---+ I eventually want to create another column "out" whose values are based on "ref" column. For example, in the first row ref column has b1 as value. In "out" column i would like to see column "b1"

Pyspark: filter dataframe by regex with string formatting?

阅读更多关于 Pyspark: filter dataframe by regex with string formatting?

问题 I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword

Pyspark: filter dataframe by regex with string formatting?

阅读更多关于 Pyspark: filter dataframe by regex with string formatting?

PySpark: How to create a nested JSON from spark data frame?

阅读更多关于 PySpark: How to create a nested JSON from spark data frame?

问题 I am trying to create a nested json from my spark dataframe which has data in following structure. The below code is creating a simple json with key and value. Could you please help df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True) Update1: As per @MaxU answer,I converted the spark data frame to pandas and used group by. It is putting the last two fields in a nested array. How could i first put the category and count in nested array and then inside

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

阅读更多关于 Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

问题 I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this: files = ['s3a://dev/2017/01/03/data.parquet', 's3a://dev/2017/01/02/data.parquet'] df = session.read.parquet(*files) This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as

Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

阅读更多关于 Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

How to convert json to pyspark dataframe (faster implementation) [duplicate]

阅读更多关于 How to convert json to pyspark dataframe (faster implementation) [duplicate]

问题 This question already has answers here : reading json file in pyspark (3 answers) Closed 2 years ago . I have json data in form of {'abc':1, 'def':2, 'ghi':3} How to convert it into pyspark dataframe in python? 回答1: import json j = {'abc':1, 'def':2, 'ghi':3} a=[json.dumps(j)] jsonRDD = sc.parallelize(a) df = spark.read.json(jsonRDD) >>> df.show() +---+---+---+ |abc|def|ghi| +---+---+---+ | 1| 2| 3| +---+---+---+ 来源： https://stackoverflow.com/questions/44456076/how-to-convert-json-to-pyspark

java.io.IOException: Cannot run program “python”: CreateProcess error=2, The system cannot find the file specified

阅读更多关于 java.io.IOException: Cannot run program “python”: CreateProcess error=2, The system cannot find the file specified

问题 I configured eclipse with pyspark i am using latest version of SPARK and PYTHON. when i try to code something and run. i get below error. java.io.IOException: Cannot run program "python": CreateProcess error=2, The system cannot find the file specified code i have written is below ''' Created on 23-Dec-2017 @author: lenovo ''' from pyspark import SparkContext,SparkConf from builtins import int #from org.spark.com.PySparkDemo import data from pyspark.sql import Row from pyspark.sql.context

Drop function not working after left outer join in pyspark

阅读更多关于 Drop function not working after left outer join in pyspark

问题 My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority . I am creating my dataframes like this: a = "select 123 as id, 1 as priority" a_df = spark.sql(a) b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority" b_df = spark.sql(b) c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority) c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int] The drop function is not removing