pyspark-sql | 易学教程

How to identify repeated occurrences of a string column in Hive?

阅读更多关于 How to identify repeated occurrences of a string column in Hive?

问题 I have a view like this in Hive: id sequencenumber appname 242539622 1 A 242539622 2 A 242539622 3 A 242539622 4 B 242539622 5 B 242539622 6 C 242539622 7 D 242539622 8 D 242539622 9 D 242539622 10 B 242539622 11 B 242539622 12 D 242539622 13 D 242539622 14 F I'd like to have, per each id, the following view: id sequencenumber appname appname_c 242539622 1 A A 242539622 2 A A 242539622 3 A A 242539622 4 B B_1 242539622 5 B B_1 242539622 6 C C 242539622 7 D D_1 242539622 8 D D_1 242539622 9 D

LEFT and RIGHT function in PySpark SQL

阅读更多关于 LEFT and RIGHT function in PySpark SQL

问题 I am new for PySpark. I pulled a csv file using pandas. And created a temp table using registerTempTable function. from pyspark.sql import SQLContext from pyspark.sql import Row import pandas as pd sqlc = SQLContext(sc) aa1 = pd.read_csv("D:\mck1.csv") aa2 = sqlc.createDataFrame(aa1) aa2.show() +--------+-------+----------+------------+---------+------------+-------------------+ | City| id|First_Name|Phone_Number|new_date|new code| New_date| +--------+-------+----------+------------+---------

Selecting empty array values from a Spark DataFrame

阅读更多关于 Selecting empty array values from a Spark DataFrame

问题 Given a DataFrame with the following rows: rows = [ Row(col1='abc', col2=[8], col3=[18], col4=[16]), Row(col2='def', col2=[18], col3=[18], col4=[]), Row(col3='ghi', col2=[], col3=[], col4=[])] I'd like to remove rows with an empty array for each of col2 , col3 and col4 (i.e. the 3rd row). For example I might expect this code to work: df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect() I have two problems how to combine where clauses with and but more importantly...

How to create a table as select in pyspark.sql

阅读更多关于 How to create a table as select in pyspark.sql

问题 Is it possible to create a table on spark using a select statement? I do the following import findspark findspark.init() import pyspark from pyspark.sql import SQLContext sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) spark_df = sqlCtx.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("./data/documents_topics.csv") spark_df.registerTempTable("my_table") sqlCtx.sql("CREATE TABLE my_table_2 AS SELECT * from my_table") but I get the error /Users/user

pyspark program throwing name 'spark' is not defined

阅读更多关于 pyspark program throwing name 'spark' is not defined

问题 Below program throwing error name 'spark' is not defined Traceback (most recent call last): File "pgm_latest.py", line 232, in <module> sconf =SparkConf().set(spark.dynamicAllocation.enabled,true) .set(spark.dynamicAllocation.maxExecutors,300) .set(spark.shuffle.service.enabled,true) .set(spark.shuffle.spill.compress,true) NameError: name 'spark' is not defined spark-submit --driver-memory 12g --master yarn-cluster --executor-memory 6g --executor-cores 3 pgm_latest.py Code #!/usr/bin/python

pyspark replace all values in dataframe with another values

阅读更多关于 pyspark replace all values in dataframe with another values

问题 I have 500 columns in my pyspark data frame...Some are of string type,some int and some boolean(100 boolean columns ). Now, all the boolean columns have two distinct levels - Yes and No and I want to convert those into 1/0 For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers c1| c2 | c3 |c4|c5..... |c500 yes| yes|passed |45.... No | Yes|failed |452.... Yes|No |None |32............ when I do df.replace(yes,1) I get

Pyspark dataframe how to drop rows with nulls in all columns?

阅读更多关于 Pyspark dataframe how to drop rows with nulls in all columns?

问题 For a dataframe, before it is like: +----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null|null|null| |null| B| X1| +----+----+----+ After I hope it's like: +----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null| B| X1| +----+----+----+ I prefer a general method such that it can apply when df.columns is very long. Thanks! 回答1: One option is to use functools.reduce to construct the conditions: from functools import reduce df.filter(~reduce(lambda x, y: x & y, [df[c]

Delete azure sql database rows from azure databricks

阅读更多关于 Delete azure sql database rows from azure databricks

问题 I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe. df.write \ .option('user', jdbcUsername) \ .option('password', jdbcPassword) \ .jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} ) But going forward I don't

How to transform DataFrame per one column to create two new columns in pyspark?

阅读更多关于 How to transform DataFrame per one column to create two new columns in pyspark?

问题 I have a dataframe "x", In which their are two columns "x1" and "x2" x1(status) x2 kv,true 45 bm,true 65 mp,true 75 kv,null 450 bm,null 550 mp,null 650 I want to convert this dataframe into a format in which data is filtered according to its status and value x1 true null kv 45 450 bm 65 550 mp 75 650 Is there a way to do this, I am using pyspark datadrame 回答1: Yes, there is a way. First split the first column by , using split function, then split this dataframe into two dataframes (using

Pyspark - how to backfill a DataFrame?

阅读更多关于 Pyspark - how to backfill a DataFrame?

问题 How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame ? The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter. In pandas you can use the following to backfill a time series: Create data import pandas as pd index = pd.date_range('2017-01-01', '2017-01-05') data = [1, 2, 3, None, 5] df = pd.DataFrame({'data': data}, index=index) Giving Out[1]: data 2017-01-01 1.0 2017-01-02 2