pyspark-sql

Is there a way to load multiple text files into a single dataframe using Databricks?

匆匆过客 提交于 2019-12-11 14:32:09
问题 I am trying to test a few ideas to recursively loop through all files in a folder and sub-folders, and load everything into a single dataframe. I have 12 different kinds of files, and the differences are based on the file naming conventions. So, I have file names that start with 'ABC', file names that start with 'CN', file names that start with 'CZ', and so on. I tried the following 3 ideas. import pyspark import os.path from pyspark.sql import SQLContext from pyspark.sql.functions import

how to convert html text into plain text using pyspark? Replacing html tags from string

空扰寡人 提交于 2019-12-11 14:26:47
问题 I have one text file in which there is one column 'descn' which has some text but they are in html format. So i want to convert html text into plain text using pyspark. Please help me to do this. file name: mdcl_insigt.txt input: PROTEUSÂ <div><br></div><div>We are struggling with pathology. We don't control specimens of prostatectomy. The hospital pathology is not cooperating. I am reaching out to another hospital. You have pretty intense manual guidelines on pathology in the [PROTEUS]

Pyspark Dataframe: Get previous row that meets a condition

只愿长相守 提交于 2019-12-11 09:24:27
问题 For every row in a PySpark DataFrame I am trying to get a value from the first preceding row that satisfied a certain condition: That is if my dataframe looks like this: X | Flag 1 | 1 2 | 0 3 | 0 4 | 0 5 | 1 6 | 0 7 | 0 8 | 0 9 | 1 10 | 0 I want output that looks like this: X | Lag_X | Flag 1 | NULL | 1 2 | 1 | 0 3 | 1 | 0 4 | 1 | 0 5 | 1 | 1 6 | 5 | 0 7 | 5 | 0 8 | 5 | 0 9 | 5 | 1 10 | 9 | 0 I thought I could do this with lag function and a WindowSpec, unfortunately WindowSpec doesnt

pySpark: java.lang.UnsupportedOperationException: Unimplemented type: StringType

泄露秘密 提交于 2019-12-11 09:05:47
问题 While reading inconsistent schema written group of parquet files, we have issue on schema merging. On switching to manually specifying schema i get following error. Any pointer will be helpful. java.lang.UnsupportedOperationException: Unimplemented type: StringType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readDoubleBatch(VectorizedColumnReader.java:389) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch

Break down a table to pivot in columns (SQL,PYSPARK)

ε祈祈猫儿з 提交于 2019-12-11 08:55:06
问题 I'm working in an environment pyspark with python3.6 in AWS Glue. I have this table : +----+-----+-----+-----+ |year|month|total| loop| +----+-----+-----+-----+ |2012| 1| 20|loop1| |2012| 2| 30|loop1| |2012| 1| 10|loop2| |2012| 2| 5|loop2| |2012| 1| 50|loop3| |2012| 2| 60|loop3| +----+-----+-----+-----+ And I need to get an output like: year month total_loop1 total_loop2 total_loop3 2012 1 20 10 50 2012 2 30 5 60 The closer I have gotten is with the SQL code: select a.year,a.month, a.total,b

How to change column name of a dataframe with respect to other dataframe

假如想象 提交于 2019-12-11 08:54:47
问题 I have a requirement to change column name of a dataframe df with respect to other dataframe df_col using pyspark df +----+---+----+----+ |code| id|name|work| +----+---+----+----+ | ASD|101|John| DEV| | klj|102| ben|prod| +----+---+----+----+ df_col +-----------+-----------+ |col_current|col_updated| +-----------+-----------+ | id| Row_id| | name| Name| | code| Row_code| | Work| Work_Code| +-----------+-----------+ if df column matches col_current, df column should replace with col_updated.

converting user-item-rating list to user-item-matrix with pyspark

痞子三分冷 提交于 2019-12-11 08:47:14
问题 This is how user-item-rating list looks like as a pandas dataframe. item_id rating user_id 0 aaaaaaa 5 X 1 bbbbbbb 2 Y 2 ccccccc 5 Z 3 ddddddd 1 T This how I create user-item-matrix in pandas and it only takes a couple of seconds with real dataset (about 500k row): user_item_matrix = df.pivot(index = 'user_id', columns ='item_id', values = 'rating') item_id aaaaaaa bbbbbbb ccccccc ddddddd user_id T NaN NaN NaN 1.0 X 5.0 NaN NaN NaN Y NaN 2.0 NaN NaN Z NaN NaN 5.0 NaN I am trying this approach

How to extract array element from PySpark dataframe conditioned on different column?

你说的曾经没有我的故事 提交于 2019-12-11 08:13:39
问题 I have the following PySpark Input Dataframe: +-------+------------+ | index | valuelist | +-------+------------+ | 1.0 | [10,20,30] | | 2.0 | [11,21,31] | | 0.0 | [14,12,15] | +-------+------------+ Where: Index: type Double Valuelist: type Vector . (it's NOT Array ) From the above Input Dataframe, I want to get the following Output Dataframe in PySpark +-------+-------+ | index | value | +-------+-------+ | 1.0 | 20 | | 2.0 | 31 | | 0.0 | 14 | +-------+-------+ Logic: for each row: value =

How to remove automatically added back ticks while using explode() in pyspark?

拥有回忆 提交于 2019-12-11 07:27:56
问题 I want to add a new column with some expression as defined here(https://www.mien.in/2018/03/25/reshaping-dataframe-using-pivot-and-melt-in-apache-spark-and-pandas/#pivot-in-spark). While doing so, my explode() function changes column names to be sought by adding back ticks(" ` ") at the beginning and at the end of each column which then gives out the error: Cannot resolve column name `Column_name` from [Column_name, Column_name2] I tried reading the documentation and few other questions on SO

PySpark trying to apply previous field's schema to next field

眉间皱痕 提交于 2019-12-11 07:05:54
问题 Having this weird issue with PySpark. It seems to be trying to apply the schema for the previous field, to the next field, as it's processing. Simplest test case I could come up with: %pyspark from pyspark.sql.types import ( DateType, StructType, StructField, StringType, ) from datetime import date from pyspark.sql import Row schema = StructType( [ StructField("date", DateType(), True), StructField("country", StringType(), True), ] ) test = spark.createDataFrame( [ Row( date=date(2019, 1, 1),