apache-spark

In spark, is their any alternative for union() function while appending new row?

混江龙づ霸主 提交于 2021-02-18 08:40:35
问题 In my code table_df has some columns on which I am doing some calculations like min, max, mean etc. and I want to create new_df with specified schema new_df_schema. In my logic, I have written spark-sql for calculations and appending each new generated row to initially empty new_df and at the end, it results in new_df with all calculated values for all columns. But the problem is when the columns are more in number it leads to performance issue. Can this be done without using union() function

How to avoid multiple window functions in a expression in pyspark

拥有回忆 提交于 2021-02-18 07:55:48
问题 I want spark to avoid creating two separate window stage, for same window object used twice in my code. How can I use it once in my code in the following example, and tell spark to do sum and division under single window. df = df.withColumn("colum_c", f.sum(f.col("colum_a")).over(window) / f.sum(f.col("colum_b")).over(window)) Example: days = lambda i: (i - 1) * 86400 window = ( Window() .partitionBy(f.col("account_id")) .orderBy(f.col("event_date").cast("timestamp").cast("long"))

Calculate a grouped median in pyspark

倖福魔咒の 提交于 2021-02-18 07:55:36
问题 When using pyspark, I'd like to be able to calculate the difference between grouped values and their median for the group. Is this possible? Here is some code I hacked up that does what I want except that it calculates the grouped diff from mean. Also, please feel free to comment on how I could make this better if you feel like being helpful :) from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import ( StringType, LongType, DoubleType, StructField,

Spark shuffle spill metrics

故事扮演 提交于 2021-02-18 06:54:09
问题 Running jobs on a spark 2.3 cluster, I noted in the spark webUI that spill occurs for some tasks : I understand that on the reduce side, the reducer fetched the needed partitions (shuffle read), then performed the reduce computation using the execution memory of the executor. As there was not enough execution memory some data was spilled. My questions: Am I correct ? Where the data is spilled ? Spark webUI states some data is spilled to memory shuffle spilled (memory) , but nothing is spilled

How to use Plotly with Zeppelin

主宰稳场 提交于 2021-02-18 00:53:51
问题 I've seen zeppelin-plotly but it seems too complicated. The other things that worries me is that it involves modifying zeppelin's .war file and I don't want to break things by error. Is there another way to use Plotly with Zeppelin? 回答1: Figured it out using the %angular interpreter feature. Here are the full steps to get it working 1: Install plotly (if you haven't) %sh pip install plotly You can also do this on the terminal if you have access to it 2: Define a plot function def plot(plot

Spark tasks blockes randomly on standalone cluster

拟墨画扇 提交于 2021-02-17 20:58:32
问题 We are having a quite complex application that runs on Spark Standalone. In some cases the tasks from one of the workers blocks randomly for an infinite amount of time in the RUNNING state. Extra info: there aren't any errors in the logs ran with logger in debug and i didn't saw any relevant messages (i see when the tasks starts but then there is not activity for it) the jobs are working ok if i have just only 1 worker the same job may execute the second time without any issues, in a proper

Should we parallelize a DataFrame like we parallelize a Seq before training

不羁的心 提交于 2021-02-17 15:36:40
问题 Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) val lr = new LogisticRegression() lr.setMaxIter(10).setRegParam(0.01) val model1 = lr.fit(training) Assuming we

Spark-HBase - GCP template (1/3) - How to locally package the Hortonworks connector?

此生再无相见时 提交于 2021-02-17 06:30:36
问题 I'm trying to test the Spark-HBase connector in the GCP context and tried to follow [1], which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and leads to following issue. Error "branch-2.4": [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project shc-core: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: NullPointerException -> [Help 1]

In Spark scala, how to check between adjacent rows in a dataframe

只谈情不闲聊 提交于 2021-02-17 05:52:12
问题 How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe . This should happen at a key level I have following data after sorting on key, dates source_Df.show() +-----+--------+------------+------------+ | key | code | begin_dt | end_dt | +-----+--------+------------+------------+ | 10 | ABC | 2018-01-01 | 2018-01-08 | | 10 | BAC | 2018-01-03 | 2018-01-15 | | 10 | CAS | 2018-01-03 | 2018-01-21 | | 20 | AAA | 2017-11-12 | 2018-01-03 | | 20 | DAS | 2018-01-01 |

Pyspark - How to get basic stats (mean, min, max) along with quantiles (25%, 50%) for numerical cols in a single dataframe

此生再无相见时 提交于 2021-02-17 05:37:26
问题 I have a spark df spark_df = spark.createDataFrame( [(1, 7, 'foo'), (2, 6, 'bar'), (3, 4, 'foo'), (4, 8, 'bar'), (5, 1, 'bar') ], ['v1', 'v2', 'id'] ) Expected Output id avg(v1) avg(v2) min(v1) min(v2) 0.25(v1) 0.25(v2) 0.5(v1) 0.5(v2) 0 bar 3.666667 5.0 2 1 some-value some-value some-value some-value 1 foo 2.000000 5.5 1 4. some-value some-value some-value some-value Until, now I can achieve the basic stats like avg, min, max. But not able to get the quantiles. I know ,this can be achieved