spark-dataframe

Get latest records in a data frame based on time stamp with condition

左心房为你撑大大i 提交于 2020-01-07 04:55:27
问题 My Question heading might not be accurate but i hope i will be able to explain my question So i have a data frame like below DataPartition_1|^|PartitionYear_1|^|TimeStamp|^|OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber_1|^|FFAction_1 SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|5|^|21|^|2|^|I|!| SelfSourcedPublic|^|2002|^|1510044629599|^|4295858941|^|1|^|22|^|2|^|I|!| SelfSourcedPublic|^|2002|^|1510044629600|^|4295858941|^|1|^|23|^|2|^|I|!| SelfSourcedPublic|^

How to convert json to pyspark dataframe (faster implementation) [duplicate]

99封情书 提交于 2020-01-07 03:47:06
问题 This question already has answers here : reading json file in pyspark (3 answers) Closed 2 years ago . I have json data in form of {'abc':1, 'def':2, 'ghi':3} How to convert it into pyspark dataframe in python? 回答1: import json j = {'abc':1, 'def':2, 'ghi':3} a=[json.dumps(j)] jsonRDD = sc.parallelize(a) df = spark.read.json(jsonRDD) >>> df.show() +---+---+---+ |abc|def|ghi| +---+---+---+ | 1| 2| 3| +---+---+---+ 来源: https://stackoverflow.com/questions/44456076/how-to-convert-json-to-pyspark

Calculate links between nodes using Spark

a 夏天 提交于 2020-01-06 05:56:44
问题 I have the following two DataFrames in Spark 2.2 and Scala 2.11. The DataFrame edges defines the edges of a directed graph, while the DataFrame types defines the type of each node. edges = +-----+-----+----+ |from |to |attr| +-----+-----+----+ | 1| 0| 1| | 1| 4| 1| | 2| 2| 1| | 4| 3| 1| | 4| 5| 1| +-----+-----+----+ types = +------+---------+ |nodeId|type | +------+---------+ | 0| 0| | 1| 0| | 2| 2| | 3| 4| | 4| 4| | 5| 4| +------+---------+ For each node, I want to know the number of edges

Regarding Spark Dataframereader jdbc

☆樱花仙子☆ 提交于 2020-01-06 04:02:57
问题 I have a question regarding Mechanics of Spark Dataframereader. I will appreciate if anybody can help me. Let me explain the Scenario here I am creating a DataFrame from Dstream like this. This in Input Data var config = new HashMap[String,String](); config += ("zookeeper.connect" ->zookeeper); config += ("partition.assignment.strategy" ->"roundrobin"); config += ("bootstrap.servers" ->broker); config += ("serializer.class" -> "kafka.serializer.DefaultEncoder"); config += ("group.id" ->

Spark Hadoop Failed to get broadcast

别等时光非礼了梦想. 提交于 2020-01-04 06:15:16
问题 Running a spark-submit job and receiving a "Failed to get broadcast_58_piece0..." error. I'm really not sure what I'm doing wrong. Am I overusing UDFs? Too complicated a function? As a summary of my objective, I am parsing text from pdfs, which are stored as base64 encoded strings in JSON objects. I'm using Apache Tika to get the text, and trying to make copious use of data frames to make things easier. I had written a piece of code that ran the text extraction through tika as a function

How to add suffix and prefix to all columns in python/pyspark dataframe

匆匆过客 提交于 2020-01-04 05:14:49
问题 I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name. For example: column name is testing user. I want `testing user` Is there a method to do this in pyspark/python. when we apply the code it should return a data frame. 回答1: You can use withColumnRenamed method of dataframe in combination with na to create new dataframe df.na.withColumnRenamed('testing

How to add suffix and prefix to all columns in python/pyspark dataframe

我怕爱的太早我们不能终老 提交于 2020-01-04 05:14:06
问题 I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name. For example: column name is testing user. I want `testing user` Is there a method to do this in pyspark/python. when we apply the code it should return a data frame. 回答1: You can use withColumnRenamed method of dataframe in combination with na to create new dataframe df.na.withColumnRenamed('testing

How to add map column in spark based on other column?

自古美人都是妖i 提交于 2020-01-03 19:34:10
问题 I have this table: |Name|Val| |----|---| |Bob |1 | |Marl|3 | And I want to transform it to a map with single element like this: |Name|Val|MapVal| |----|---|------| |Bob |1 |(0->1)| |Marl|3 |(0->3)| Any idea how to do it in scala? I couldn't find any way to build a map in withColumn statement... 回答1: Found it - Just need to include the implicit sql: import org.apache.spark.sql.functions._ And then use the map function: df.withColumn("MapVal", map(lit(0), col("Val"))) 来源: https://stackoverflow

Spark Structured streaming- Using different Windows for different GroupBy Keys

空扰寡人 提交于 2020-01-02 12:02:53
问题 Currently i have following table after reading from a Kafka topic via spark structured streaming key,timestamp,value ----------------------------------- key1,2017-11-14 07:50:00+0000,10 key1,2017-11-14 07:50:10+0000,10 key1,2017-11-14 07:51:00+0000,10 key1,2017-11-14 07:51:10+0000,10 key1,2017-11-14 07:52:00+0000,10 key1,2017-11-14 07:52:10+0000,10 key2,2017-11-14 07:50:00+0000,10 key2,2017-11-14 07:51:00+0000,10 key2,2017-11-14 07:52:10+0000,10 key2,2017-11-14 07:53:00+0000,10 I would like

Split column of list into multiple columns in the same PySpark dataframe

只愿长相守 提交于 2020-01-02 07:19:47
问题 I have the following dataframe which contains 2 columns: 1st column has column names 2nd Column has list of values. +--------------------+--------------------+ | Column| Quantile| +--------------------+--------------------+ | rent|[4000.0, 4500.0, ...| | is_rent_changed|[0.0, 0.0, 0.0, 0...| | phone|[7.022372888E9, 7...| | Area_house|[1000.0, 1000.0, ...| | bedroom_count|[1.0, 1.0, 1.0, 1...| | bathroom_count|[1.0, 1.0, 1.0, 1...| | maintenance_cost|[0.0, 0.0, 0.0, 0...| | latitude|[12