apache-spark-standalone

Continuously INFO JobScheduler:59 - Added jobs for time *** ms in my Spark Standalone Cluster

帅比萌擦擦* 提交于 2019-12-19 09:58:15
问题 We are working with Spark Standalone Cluster with 8 Cores and 32GB Ram, with 3 nodes cluster with same configuration. Some times streaming batch completed in less than 1sec. some times it takes more than 10 secs at that time below log will appears in console. 2016-03-29 11:35:25,044 INFO TaskSchedulerImpl:59 - Removed TaskSet 18.0, whose tasks have all completed, from pool 2016-03-29 11:35:25,044 INFO DAGScheduler:59 - Job 18 finished: foreachRDD at EventProcessor.java:87, took 1.128755 s

Spark web UI unreachable

萝らか妹 提交于 2019-12-11 11:46:31
问题 i have installed spark2.0.0 on 12 nodes (in cluster standalone mode), when i launch it i get this : ./sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /home/mName/fer/spark-2.0.0-bin-hadoop2.7/logs/spark-mName-org.apache.spark.deploy.master.Master-1-ibnb25.out localhost192.17.0.17: ssh: Could not resolve hostname localhost192.17.0.17: Name or service not known 192.17.0.20: starting org.apache.spark.deploy.worker.Worker, logging to /home/mbala/fer/spark-2.0.0-bin

Spark workers stopped after driver commanded a shutdown

社会主义新天地 提交于 2019-12-10 06:39:55
问题 Basically, Master node also perform as a one of the slave. Once slave on master completed it called the SparkContext to stop and hence this command propagate to all the slaves which stop the execution in mid of the processing. Error log in one of the worker: INFO SparkHadoopMapRedUtil: attempt_201612061001_0008_m_000005_18112: Committed INFO Executor: Finished task 5.0 in stage 8.0 (TID 18112). 2536 bytes result sent to driver INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown

Spark working faster in Standalone rather than YARN

£可爱£侵袭症+ 提交于 2019-12-01 13:58:17
Wanted some insights on spark execution on standalone and yarn. We have a 4 node cloudera cluster and currently the performance of our application while running in YARN mode is less than half than what we are getting while executing in standalone mode. Is anyone having some idea on the factors which might be contributing for this. Basically, your data and cluster are too small. Big Data technologies are really meant to handle data that cannot fit on a single system. Given your cluster has 4 nodes, it might be fine for POC work but you should not consider this acceptable for benchmarking your

Spark working faster in Standalone rather than YARN

不问归期 提交于 2019-12-01 10:49:29
问题 Wanted some insights on spark execution on standalone and yarn. We have a 4 node cloudera cluster and currently the performance of our application while running in YARN mode is less than half than what we are getting while executing in standalone mode. Is anyone having some idea on the factors which might be contributing for this. 回答1: Basically, your data and cluster are too small. Big Data technologies are really meant to handle data that cannot fit on a single system. Given your cluster

Spark standalone connection driver to worker

北慕城南 提交于 2019-11-29 12:50:55
I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker. I have the following configuration master on machine 1 (port 7077 exposed) worker on machine 1 driver on machine 2 I use a test application that opens a file and counts its lines. The application works when the file replicated on all workers and I use SparkContext.readText() But when when the file is only present on worker while I'm using SparkContext.parallelize() to access it on workers, I have the following display :

winutils spark windows installation env_variable

混江龙づ霸主 提交于 2019-11-28 09:47:45
问题 I am trying to install Spark 1.6.1 on windows 10 and so far I have done the following... Downloaded spark 1.6.1, unpacked to some directory and then set SPARK_HOME Downloaded scala 2.11.8, unpacked to some directory and then set SCALA_HOME Set the _JAVA_OPTION env variable Downloaded the winutils from https://github.com/steveloughran/winutils.git by just downloading the zip directory and then set HADOOP_HOME env variable. (Not sure if this was incorrect, I could not clone the directory

What is the relationship between workers, worker instances, and executors?

北城以北 提交于 2019-11-28 03:03:42
In Spark Standalone mode , there are master and worker nodes. Here are few questions: Does 2 worker instance mean one worker node with 2 worker processes? Does every worker instance hold an executor for specific application (which manages storage, task) or one worker node holds one executor? Is there a flow chart explain how spark runtime, such as word count? I suggest reading the Spark cluster docs first, but even more so this Cloudera blog post explaining these modes. Your first question depends on what you mean by 'instances'. A node is a machine, and there's not a good reason to run more

Which cluster type should I choose for Spark?

余生长醉 提交于 2019-11-27 16:56:57
I am new to Apache Spark, and I just learned that Spark supports three types of cluster: Standalone - meaning Spark will manage its own cluster YARN - using Hadoop's YARN resource manager Mesos - Apache's dedicated resource manager project Since I am new to Spark, I think I should try Standalone first. But I wonder which one is the recommended. Say, in the future I need to build a large cluster (hundreds of instances), which cluster type should I go to? I think the best to answer that are those who work on Spark. So, from Learning Spark Start with a standalone cluster if this is a new

What happens when Spark master fails?

∥☆過路亽.° 提交于 2019-11-27 16:19:32
问题 Does the driver need constant access to the master node? Or is it only required to get initial resource allocation? What happens if master is not available after Spark context has been created? Does it mean application will fail? 回答1: The first and probably the most serious for the time being consequence of a master failure or a network partition is that your cluster won't be able to accept new applications. This is why Master is considered to be a single point of failure when cluster is used