pyspark-sql

Spark Multiple Conditions Join

。_饼干妹妹 提交于 2020-04-30 07:39:10
问题 I am using spark sql to join three tables, however i get error with multiple column conditions. test_table = (T1.join(T2,T1.dtm == T2.kids_dtm, "inner") .join(T3, T3.kids_dtm == T1.dtm and T2.room_id == T3.room_id and T2.book_id == T3.book_id, "inner")) ERROR: Traceback (most recent call last): File "<stdin>", line 4, in <module> File "/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/python/pyspark/sql/column.py", line 447, in __nonzero__ raise ValueError("Cannot convert column into

Spark sql Optimization Techniques loading csv to orc format of hive

独自空忆成欢 提交于 2020-04-30 07:15:04
问题 Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select

Spark sql Optimization Techniques loading csv to orc format of hive

為{幸葍}努か 提交于 2020-04-30 07:14:46
问题 Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select

Pyspark alter column with substring

|▌冷眼眸甩不掉的悲伤 提交于 2020-04-29 12:13:32
问题 Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string. from pyspark.sql.functions import substring import pandas as pd pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']}) # this is what i'm looking for... pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1] df = sqlContext.createDataFrame(pdf) # following not working... COLUMN_NAME_fix is blank df.withColumn('COLUMN_NAME_fix',

Pyspark alter column with substring

北城以北 提交于 2020-04-29 12:13:06
问题 Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string. from pyspark.sql.functions import substring import pandas as pd pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']}) # this is what i'm looking for... pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1] df = sqlContext.createDataFrame(pdf) # following not working... COLUMN_NAME_fix is blank df.withColumn('COLUMN_NAME_fix',

Filtering a pyspark dataframe using isin by exclusion [duplicate]

可紊 提交于 2020-04-27 19:46:51
问题 This question already has answers here : Pyspark dataframe operator “IS NOT IN” (6 answers) Closed last year . I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion). As an example: df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')] ,schema=('id','bar')) I get the data frame: +---+---+ | id|bar| +---+---+ | 1| a| | 2| b| | 3| b| | 4| c| | 5| d| +---+---+ I only want to exclude rows where bar is ('a

concatenating two columns in pyspark data frame according to alphabets order [duplicate]

萝らか妹 提交于 2020-04-17 22:54:55
问题 This question already has an answer here : how to sort value before concatenate text columns in pyspark (1 answer) Closed 11 days ago . I have a pyspark data Frame with 5M data and I am going to apply fuzzy logic(Levenshtein and Soundex functions) to find duplicates at the first name and last name columns. Inpt data Before that, I want to do resequence first name and last name columns value so that I get correct Levenshtein distance. df = df.withColumn('full_name', f.concat(f.col('first'),f

get distinct count from an array of each rows using pyspark

孤人 提交于 2020-04-16 03:31:34
问题 I am looking for distinct counts from an array of each rows using pyspark dataframe: input: col1 [1,1,1] [3,4,5] [1,2,1,2] output: 1 3 2 I used below code but it is giving me the length of an array: output: 3 3 4 please help me how do i achieve this using python pyspark dataframe. slen = udf(lambda s: len(s), IntegerType()) count = Df.withColumn("Count", slen(df.col1)) count.show() Thanks in advanced ! 回答1: For spark2.4+ you can use array_distinct and then just get the size of that, to get

get distinct count from an array of each rows using pyspark

随声附和 提交于 2020-04-16 03:31:14
问题 I am looking for distinct counts from an array of each rows using pyspark dataframe: input: col1 [1,1,1] [3,4,5] [1,2,1,2] output: 1 3 2 I used below code but it is giving me the length of an array: output: 3 3 4 please help me how do i achieve this using python pyspark dataframe. slen = udf(lambda s: len(s), IntegerType()) count = Df.withColumn("Count", slen(df.col1)) count.show() Thanks in advanced ! 回答1: For spark2.4+ you can use array_distinct and then just get the size of that, to get

Is there a way to control number of part files in hdfs created from spark dataframe? [duplicate]

和自甴很熟 提交于 2020-04-10 06:39:28
问题 This question already has answers here : Spark How to Specify Number of Resulting Files for DataFrame While/After Writing (1 answer) How to control the number of output part files created by Spark job upon writing? (2 answers) Closed 17 days ago . When i save the DataFrame resulting from sparksql query in HDFS, it generates large number of part files with each one at 1.4 KB. is there a way to increase size of file as every part file contains about 2 records. df_crimes_dates_formated = spark