pyspark-sql

How to identify repeated occurrences of a string column in Hive?

谁说我不能喝 提交于 2019-12-24 07:24:06
问题 I have a view like this in Hive: id sequencenumber appname 242539622 1 A 242539622 2 A 242539622 3 A 242539622 4 B 242539622 5 B 242539622 6 C 242539622 7 D 242539622 8 D 242539622 9 D 242539622 10 B 242539622 11 B 242539622 12 D 242539622 13 D 242539622 14 F I'd like to have, per each id, the following view: id sequencenumber appname appname_c 242539622 1 A A 242539622 2 A A 242539622 3 A A 242539622 4 B B_1 242539622 5 B B_1 242539622 6 C C 242539622 7 D D_1 242539622 8 D D_1 242539622 9 D

LEFT and RIGHT function in PySpark SQL

☆樱花仙子☆ 提交于 2019-12-24 04:46:08
问题 I am new for PySpark. I pulled a csv file using pandas. And created a temp table using registerTempTable function. from pyspark.sql import SQLContext from pyspark.sql import Row import pandas as pd sqlc = SQLContext(sc) aa1 = pd.read_csv("D:\mck1.csv") aa2 = sqlc.createDataFrame(aa1) aa2.show() +--------+-------+----------+------------+---------+------------+-------------------+ | City| id|First_Name|Phone_Number|new_date|new code| New_date| +--------+-------+----------+------------+---------

Selecting empty array values from a Spark DataFrame

一世执手 提交于 2019-12-24 00:59:56
问题 Given a DataFrame with the following rows: rows = [ Row(col1='abc', col2=[8], col3=[18], col4=[16]), Row(col2='def', col2=[18], col3=[18], col4=[]), Row(col3='ghi', col2=[], col3=[], col4=[])] I'd like to remove rows with an empty array for each of col2 , col3 and col4 (i.e. the 3rd row). For example I might expect this code to work: df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect() I have two problems how to combine where clauses with and but more importantly...

How to create a table as select in pyspark.sql

谁说胖子不能爱 提交于 2019-12-23 20:46:11
问题 Is it possible to create a table on spark using a select statement? I do the following import findspark findspark.init() import pyspark from pyspark.sql import SQLContext sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) spark_df = sqlCtx.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("./data/documents_topics.csv") spark_df.registerTempTable("my_table") sqlCtx.sql("CREATE TABLE my_table_2 AS SELECT * from my_table") but I get the error /Users/user

pyspark program throwing name 'spark' is not defined

半城伤御伤魂 提交于 2019-12-23 15:59:13
问题 Below program throwing error name 'spark' is not defined Traceback (most recent call last): File "pgm_latest.py", line 232, in <module> sconf =SparkConf().set(spark.dynamicAllocation.enabled,true) .set(spark.dynamicAllocation.maxExecutors,300) .set(spark.shuffle.service.enabled,true) .set(spark.shuffle.spill.compress,true) NameError: name 'spark' is not defined spark-submit --driver-memory 12g --master yarn-cluster --executor-memory 6g --executor-cores 3 pgm_latest.py Code #!/usr/bin/python

pyspark replace all values in dataframe with another values

醉酒当歌 提交于 2019-12-23 14:59:12
问题 I have 500 columns in my pyspark data frame...Some are of string type,some int and some boolean(100 boolean columns ). Now, all the boolean columns have two distinct levels - Yes and No and I want to convert those into 1/0 For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers c1| c2 | c3 |c4|c5..... |c500 yes| yes|passed |45.... No | Yes|failed |452.... Yes|No |None |32............ when I do df.replace(yes,1) I get

Pyspark dataframe how to drop rows with nulls in all columns?

◇◆丶佛笑我妖孽 提交于 2019-12-23 07:29:12
问题 For a dataframe, before it is like: +----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null|null|null| |null| B| X1| +----+----+----+ After I hope it's like: +----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null| B| X1| +----+----+----+ I prefer a general method such that it can apply when df.columns is very long. Thanks! 回答1: One option is to use functools.reduce to construct the conditions: from functools import reduce df.filter(~reduce(lambda x, y: x & y, [df[c]

Delete azure sql database rows from azure databricks

梦想与她 提交于 2019-12-23 04:54:09
问题 I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe. df.write \ .option('user', jdbcUsername) \ .option('password', jdbcPassword) \ .jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} ) But going forward I don't

How to transform DataFrame per one column to create two new columns in pyspark?

谁说胖子不能爱 提交于 2019-12-23 03:39:19
问题 I have a dataframe "x", In which their are two columns "x1" and "x2" x1(status) x2 kv,true 45 bm,true 65 mp,true 75 kv,null 450 bm,null 550 mp,null 650 I want to convert this dataframe into a format in which data is filtered according to its status and value x1 true null kv 45 450 bm 65 550 mp 75 650 Is there a way to do this, I am using pyspark datadrame 回答1: Yes, there is a way. First split the first column by , using split function, then split this dataframe into two dataframes (using

Pyspark - how to backfill a DataFrame?

拟墨画扇 提交于 2019-12-22 13:50:24
问题 How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame ? The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter. In pandas you can use the following to backfill a time series: Create data import pandas as pd index = pd.date_range('2017-01-01', '2017-01-05') data = [1, 2, 3, None, 5] df = pd.DataFrame({'data': data}, index=index) Giving Out[1]: data 2017-01-01 1.0 2017-01-02 2