apache-spark | 易学教程

In spark, is their any alternative for union() function while appending new row?

阅读更多关于 In spark, is their any alternative for union() function while appending new row?

问题 In my code table_df has some columns on which I am doing some calculations like min, max, mean etc. and I want to create new_df with specified schema new_df_schema. In my logic, I have written spark-sql for calculations and appending each new generated row to initially empty new_df and at the end, it results in new_df with all calculated values for all columns. But the problem is when the columns are more in number it leads to performance issue. Can this be done without using union() function

How to avoid multiple window functions in a expression in pyspark

阅读更多关于 How to avoid multiple window functions in a expression in pyspark

问题 I want spark to avoid creating two separate window stage, for same window object used twice in my code. How can I use it once in my code in the following example, and tell spark to do sum and division under single window. df = df.withColumn("colum_c", f.sum(f.col("colum_a")).over(window) / f.sum(f.col("colum_b")).over(window)) Example: days = lambda i: (i - 1) * 86400 window = ( Window() .partitionBy(f.col("account_id")) .orderBy(f.col("event_date").cast("timestamp").cast("long"))

Calculate a grouped median in pyspark

阅读更多关于 Calculate a grouped median in pyspark

问题 When using pyspark, I'd like to be able to calculate the difference between grouped values and their median for the group. Is this possible? Here is some code I hacked up that does what I want except that it calculates the grouped diff from mean. Also, please feel free to comment on how I could make this better if you feel like being helpful :) from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.types import ( StringType, LongType, DoubleType, StructField,

Spark shuffle spill metrics

阅读更多关于 Spark shuffle spill metrics

问题 Running jobs on a spark 2.3 cluster, I noted in the spark webUI that spill occurs for some tasks : I understand that on the reduce side, the reducer fetched the needed partitions (shuffle read), then performed the reduce computation using the execution memory of the executor. As there was not enough execution memory some data was spilled. My questions: Am I correct ? Where the data is spilled ? Spark webUI states some data is spilled to memory shuffle spilled (memory) , but nothing is spilled

How to use Plotly with Zeppelin

阅读更多关于 How to use Plotly with Zeppelin

问题 I've seen zeppelin-plotly but it seems too complicated. The other things that worries me is that it involves modifying zeppelin's .war file and I don't want to break things by error. Is there another way to use Plotly with Zeppelin? 回答1: Figured it out using the %angular interpreter feature. Here are the full steps to get it working 1: Install plotly (if you haven't) %sh pip install plotly You can also do this on the terminal if you have access to it 2: Define a plot function def plot(plot

Spark tasks blockes randomly on standalone cluster

阅读更多关于 Spark tasks blockes randomly on standalone cluster

问题 We are having a quite complex application that runs on Spark Standalone. In some cases the tasks from one of the workers blocks randomly for an infinite amount of time in the RUNNING state. Extra info: there aren't any errors in the logs ran with logger in debug and i didn't saw any relevant messages (i see when the tasks starts but then there is not activity for it) the jobs are working ok if i have just only 1 worker the same job may execute the second time without any issues, in a proper

Should we parallelize a DataFrame like we parallelize a Seq before training

阅读更多关于 Should we parallelize a DataFrame like we parallelize a Seq before training

问题 Consider the code given here, https://spark.apache.org/docs/1.2.0/ml-guide.html import org.apache.spark.ml.classification.LogisticRegression val training = sparkContext.parallelize(Seq( LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)), LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)), LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)), LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5)))) val lr = new LogisticRegression() lr.setMaxIter(10).setRegParam(0.01) val model1 = lr.fit(training) Assuming we

Spark-HBase - GCP template (1/3) - How to locally package the Hortonworks connector?

阅读更多关于 Spark-HBase - GCP template (1/3) - How to locally package the Hortonworks connector?

问题 I'm trying to test the Spark-HBase connector in the GCP context and tried to follow [1], which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and leads to following issue. Error "branch-2.4": [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project shc-core: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: NullPointerException -> [Help 1]

In Spark scala, how to check between adjacent rows in a dataframe

阅读更多关于 In Spark scala, how to check between adjacent rows in a dataframe

问题 How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe . This should happen at a key level I have following data after sorting on key, dates source_Df.show() +-----+--------+------------+------------+ | key | code | begin_dt | end_dt | +-----+--------+------------+------------+ | 10 | ABC | 2018-01-01 | 2018-01-08 | | 10 | BAC | 2018-01-03 | 2018-01-15 | | 10 | CAS | 2018-01-03 | 2018-01-21 | | 20 | AAA | 2017-11-12 | 2018-01-03 | | 20 | DAS | 2018-01-01 |

Pyspark - How to get basic stats (mean, min, max) along with quantiles (25%, 50%) for numerical cols in a single dataframe

阅读更多关于 Pyspark - How to get basic stats (mean, min, max) along with quantiles (25%, 50%) for numerical cols in a single dataframe

问题 I have a spark df spark_df = spark.createDataFrame( [(1, 7, 'foo'), (2, 6, 'bar'), (3, 4, 'foo'), (4, 8, 'bar'), (5, 1, 'bar') ], ['v1', 'v2', 'id'] ) Expected Output id avg(v1) avg(v2) min(v1) min(v2) 0.25(v1) 0.25(v2) 0.5(v1) 0.5(v2) 0 bar 3.666667 5.0 2 1 some-value some-value some-value some-value 1 foo 2.000000 5.5 1 4. some-value some-value some-value some-value Until, now I can achieve the basic stats like avg, min, max. But not able to get the quantiles. I know ,this can be achieved