Hadoop 2.7, Spark, Hive, JasperReports, Scoop - Architecuture

问题

1st of all this not a question asking for help to deploy below components step by step. What I'm asking is for an advice on how the architecture should be designed. What I'm planning to do is develop a reporting platform using existing data. Following is data I gathering by researching.

I have an existing RDBMS which has large number of records. So I'm using

Scoop - Extract data from RDBMS to Hadoop
Hadoop - Storage platform
Hive - Datawarehouse
Spark - Since Hive is more like batch processing Spark on Hive will speed up things
JasperReports - To generate reports.

What I have done up to know is deployed a Hadoop 2 cluster as follows

192.168.X.A - Namenode
192.168.X.B - 2nd Namenode
192.168.X.C - Slave1
192.168.X.D - Slave2
192.168.X.E - Slave3

My problems are

In which node should I deploy Spark? A or B, Given that I want to support fail-over. That's why I have a separate namenode configured on B.
Should I deploy Spark on each and every instances? Who are the worker nodes should be?
In which node should I deploy Hive? Is there a better alternative to Hive?
How should I connect JasperReports? And to where? To Hive or Spark?

Please tell me a suitable way to design the architecture? Please provide a elaborated answer.

Note that if you can provide any technical guides or case studies in similar nature it would be really helpful.

回答1:

You've figured it out, already! All my answers are merely general opinions and might drastically change depending on data, flavors of operations to be performed. Also question implies data and results of such operations are mission critical, I assumed so.

Spark on Hive will speed up things

Not necessarily correct. Anecdotal evidence, this post (by cloudera), proves the quite opposite. There is actually a move towards the vice-versa, i.e. Hive on Spark.

In which node should I deploy Spark? A or B, Given that I want to support fail-over. That's why I have a separate namenode configured on B. Should I deploy Spark on each and every instances? Who are the worker nodes should be?

Definitely - in most cases anyway. Set A or B as master, all of the rest can be worker nodes. If you don't want to have SPOF in your architecture, see high availability section of spark documentation, requires a bit of extra work.

Is there a better alternative to Hive?

This one is both subjective and task-specific. If SQL querying feels natural and fits the task, there is also Impala promoted by Cloudera, which claims to perform and order of magnitude faster than Hive. But is sort of a stranger in Apache Hadoop ecosystem. With Spark -and if you are fine typing a bit of python or scala- you can do SQL-like querying while still enjoying the expressive power these languages provide.

How should I connect JasperReports? And to where? To Hive or Spark?

Don't know about this one.

来源：https://stackoverflow.com/questions/33635234/hadoop-2-7-spark-hive-jasperreports-scoop-architecuture

标签

Hadoop

apache-spark

Hive

jasper-reports

hadoop2