pyspark-sql | 易学教程

Is there a way to load multiple text files into a single dataframe using Databricks?

阅读更多关于 Is there a way to load multiple text files into a single dataframe using Databricks?

问题 I am trying to test a few ideas to recursively loop through all files in a folder and sub-folders, and load everything into a single dataframe. I have 12 different kinds of files, and the differences are based on the file naming conventions. So, I have file names that start with 'ABC', file names that start with 'CN', file names that start with 'CZ', and so on. I tried the following 3 ideas. import pyspark import os.path from pyspark.sql import SQLContext from pyspark.sql.functions import

how to convert html text into plain text using pyspark? Replacing html tags from string

阅读更多关于 how to convert html text into plain text using pyspark? Replacing html tags from string

问题 I have one text file in which there is one column 'descn' which has some text but they are in html format. So i want to convert html text into plain text using pyspark. Please help me to do this. file name: mdcl_insigt.txt input: PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS]

Pyspark Dataframe: Get previous row that meets a condition

阅读更多关于 Pyspark Dataframe: Get previous row that meets a condition

问题 For every row in a PySpark DataFrame I am trying to get a value from the first preceding row that satisfied a certain condition: That is if my dataframe looks like this: X | Flag 1 | 1 2 | 0 3 | 0 4 | 0 5 | 1 6 | 0 7 | 0 8 | 0 9 | 1 10 | 0 I want output that looks like this: X | Lag_X | Flag 1 | NULL | 1 2 | 1 | 0 3 | 1 | 0 4 | 1 | 0 5 | 1 | 1 6 | 5 | 0 7 | 5 | 0 8 | 5 | 0 9 | 5 | 1 10 | 9 | 0 I thought I could do this with lag function and a WindowSpec, unfortunately WindowSpec doesnt

pySpark: java.lang.UnsupportedOperationException: Unimplemented type: StringType

阅读更多关于 pySpark: java.lang.UnsupportedOperationException: Unimplemented type: StringType

问题 While reading inconsistent schema written group of parquet files, we have issue on schema merging. On switching to manually specifying schema i get following error. Any pointer will be helpful. java.lang.UnsupportedOperationException: Unimplemented type: StringType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readDoubleBatch(VectorizedColumnReader.java:389) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch

Break down a table to pivot in columns (SQL,PYSPARK)

阅读更多关于 Break down a table to pivot in columns (SQL,PYSPARK)

问题 I'm working in an environment pyspark with python3.6 in AWS Glue. I have this table : +----+-----+-----+-----+ |year|month|total| loop| +----+-----+-----+-----+ |2012| 1| 20|loop1| |2012| 2| 30|loop1| |2012| 1| 10|loop2| |2012| 2| 5|loop2| |2012| 1| 50|loop3| |2012| 2| 60|loop3| +----+-----+-----+-----+ And I need to get an output like: year month total_loop1 total_loop2 total_loop3 2012 1 20 10 50 2012 2 30 5 60 The closer I have gotten is with the SQL code: select a.year,a.month, a.total,b

How to change column name of a dataframe with respect to other dataframe

阅读更多关于 How to change column name of a dataframe with respect to other dataframe

converting user-item-rating list to user-item-matrix with pyspark

阅读更多关于 converting user-item-rating list to user-item-matrix with pyspark

问题 This is how user-item-rating list looks like as a pandas dataframe. item_id rating user_id 0 aaaaaaa 5 X 1 bbbbbbb 2 Y 2 ccccccc 5 Z 3 ddddddd 1 T This how I create user-item-matrix in pandas and it only takes a couple of seconds with real dataset (about 500k row): user_item_matrix = df.pivot(index = 'user_id', columns ='item_id', values = 'rating') item_id aaaaaaa bbbbbbb ccccccc ddddddd user_id T NaN NaN NaN 1.0 X 5.0 NaN NaN NaN Y NaN 2.0 NaN NaN Z NaN NaN 5.0 NaN I am trying this approach

How to extract array element from PySpark dataframe conditioned on different column?

阅读更多关于 How to extract array element from PySpark dataframe conditioned on different column?

问题 I have the following PySpark Input Dataframe: +-------+------------+ | index | valuelist | +-------+------------+ | 1.0 | [10,20,30] | | 2.0 | [11,21,31] | | 0.0 | [14,12,15] | +-------+------------+ Where: Index: type Double Valuelist: type Vector . (it's NOT Array ) From the above Input Dataframe, I want to get the following Output Dataframe in PySpark +-------+-------+ | index | value | +-------+-------+ | 1.0 | 20 | | 2.0 | 31 | | 0.0 | 14 | +-------+-------+ Logic: for each row: value =

How to remove automatically added back ticks while using explode() in pyspark?

阅读更多关于 How to remove automatically added back ticks while using explode() in pyspark?

问题 I want to add a new column with some expression as defined here(https://www.mien.in/2018/03/25/reshaping-dataframe-using-pivot-and-melt-in-apache-spark-and-pandas/#pivot-in-spark). While doing so, my explode() function changes column names to be sought by adding back ticks(" ` ") at the beginning and at the end of each column which then gives out the error: Cannot resolve column name `Column_name` from [Column_name, Column_name2] I tried reading the documentation and few other questions on SO

PySpark trying to apply previous field's schema to next field

阅读更多关于 PySpark trying to apply previous field's schema to next field

问题 Having this weird issue with PySpark. It seems to be trying to apply the schema for the previous field, to the next field, as it's processing. Simplest test case I could come up with: %pyspark from pyspark.sql.types import ( DateType, StructType, StructField, StringType, ) from datetime import date from pyspark.sql import Row schema = StructType( [ StructField("date", DateType(), True), StructField("country", StringType(), True), ] ) test = spark.createDataFrame( [ Row( date=date(2019, 1, 1),