pyspark-sql

change values of structure dataframe

风流意气都作罢 提交于 2020-01-16 19:37:06
问题 I want to fill field structure from another existing structure A11 of my data1 will get the value of x1.f2 . I tried different manner and I didn't succeed. Please, who have an idea?. schema = StructType( [ StructField('data1', StructType([ StructField('A1', StructType([ StructField('A11', StringType(),True), StructField('A12', IntegerType(),True) ]) ), StructField('A2', IntegerType(),True) ]) )]) df = sqlCtx.createDataFrame([],schema) #Creation of df1 schema1 = StructType( [ StructField('x1',

Find number of rows in a given week in PySpark

孤者浪人 提交于 2020-01-16 05:36:08
问题 I have a PySpark dataframe, a small portion of which is given below: +------+-----+-------------------+-----+ | name| type| timestamp|score| +------+-----+-------------------+-----+ | name1|type1|2012-01-10 00:00:00| 11| | name1|type1|2012-01-10 00:00:10| 14| | name1|type1|2012-01-10 00:00:20| 2| | name1|type1|2012-01-10 00:00:30| 3| | name1|type1|2012-01-10 00:00:40| 55| | name1|type1|2012-01-10 00:00:50| 10| | name5|type1|2012-01-10 00:01:00| 5| | name2|type2|2012-01-10 00:01:10| 8| | name5

pyspark change day in datetime column

穿精又带淫゛_ 提交于 2020-01-15 10:53:39
问题 what is wrong with this code trying to change day of a datetime columns import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes import datetime sc = pyspark.SparkContext(appName="test") sqlcontext = pyspark.SQLContext(sc) rdd = sc.parallelize([('a',datetime.datetime(2014, 1, 9, 0, 0)), ('b',datetime.datetime(2014, 1, 27, 0, 0)), ('c',datetime.datetime(2014, 1, 31, 0, 0))]) testdf = sqlcontext.createDataFrame(rdd, ["id", "date"]) print(testdf.show()) print

pyspark change day in datetime column

那年仲夏 提交于 2020-01-15 10:51:22
问题 what is wrong with this code trying to change day of a datetime columns import pyspark import pyspark.sql.functions as sf import pyspark.sql.types as sparktypes import datetime sc = pyspark.SparkContext(appName="test") sqlcontext = pyspark.SQLContext(sc) rdd = sc.parallelize([('a',datetime.datetime(2014, 1, 9, 0, 0)), ('b',datetime.datetime(2014, 1, 27, 0, 0)), ('c',datetime.datetime(2014, 1, 31, 0, 0))]) testdf = sqlcontext.createDataFrame(rdd, ["id", "date"]) print(testdf.show()) print

Remove duplicate rows, regardless of new information -PySpark

纵然是瞬间 提交于 2020-01-15 10:15:39
问题 Say I have a dataframe like so: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm 4 imgix.com/lks032m 4 imgix.com/903248 I'd like to end up with: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark? 回答1: Group by on col('ID') Use collect_list with agg to aggregate the list Call getItem(0)

Remove duplicate rows, regardless of new information -PySpark

北城余情 提交于 2020-01-15 10:15:07
问题 Say I have a dataframe like so: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm 4 imgix.com/lks032m 4 imgix.com/903248 I'd like to end up with: ID Media 1 imgix.com/20830dk 2 imgix.com/202398pwe 3 imgix.com/lvw0923dk 4 imgix.com/082kldcm Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark? 回答1: Group by on col('ID') Use collect_list with agg to aggregate the list Call getItem(0)

How to optimize percentage check and cols drop in large pyspark dataframe?

删除回忆录丶 提交于 2020-01-15 09:48:08
问题 I have a sample pandas dataframe like as shown below. But my real data is 40 million rows and 5200 columns df = pd.DataFrame({ 'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4], 'readings' : ['READ_1','READ_2','READ_1','READ_3',np.nan,'READ_5',np.nan,'READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'], 'val' :[5,6,7,np.nan,np.nan,7,np.nan,12,13,56,32,13,45,43,46], }) from pyspark.sql.types import * from pyspark.sql.functions import isnan, when, count, col mySchema =

pyspark replace multiple values with null in dataframe

隐身守侯 提交于 2020-01-15 06:43:09
问题 I have a dataframe (df) and within the dataframe I have a column user_id df = sc.parallelize([(1, "not_set"), (2, "user_001"), (3, "user_002"), (4, "n/a"), (5, "N/A"), (6, "userid_not_set"), (7, "user_003"), (8, "user_004")]).toDF(["key", "user_id"]) df: +---+--------------+ |key| user_id| +---+--------------+ | 1| not_set| | 2| user_003| | 3| user_004| | 4| n/a| | 5| N/A| | 6|userid_not_set| | 7| user_003| | 8| user_004| +---+--------------+ I would like to replace the following values: not

Write spark dataframe to single parquet file

穿精又带淫゛_ 提交于 2020-01-14 01:56:06
问题 I am trying to do something very simple and I'm having some very stupid struggles. I think it must have to do with a fundamental misunderstanding of what spark is doing. I would greatly appreciate any help or explanation. I have a very large (~3 TB, ~300MM rows, 25k partitions) table, saved as parquet in s3, and I would like to give someone a tiny sample of it as a single parquet file. Unfortunately, this is taking forever to finish and I don't understand why. I have tried the following: tiny

Read in CSV in Pyspark with correct Datatypes

£可爱£侵袭症+ 提交于 2020-01-13 10:59:26
问题 When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this: "Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey" 149332,"15.11.2005",1,199.95,107,127998739,100000 I have found code that should work in this question, but when I execute it all the entries are returned as NULL . I use the following to create a custom schema: