Hadoop 2.7, Spark, Hive, JasperReports, Scoop - Architecuture

…衆ロ難τιáo~ 提交于 2019-12-23 05:07:23

问题


1st of all this not a question asking for help to deploy below components step by step. What I'm asking is for an advice on how the architecture should be designed. What I'm planning to do is develop a reporting platform using existing data. Following is data I gathering by researching.

I have an existing RDBMS which has large number of records. So I'm using

  • Scoop - Extract data from RDBMS to Hadoop
  • Hadoop - Storage platform
  • Hive - Datawarehouse
  • Spark - Since Hive is more like batch processing Spark on Hive will speed up things
  • JasperReports - To generate reports.

What I have done up to know is deployed a Hadoop 2 cluster as follows

  • 192.168.X.A - Namenode
  • 192.168.X.B - 2nd Namenode
  • 192.168.X.C - Slave1
  • 192.168.X.D - Slave2
  • 192.168.X.E - Slave3

My problems are

  • In which node should I deploy Spark? A or B, Given that I want to support fail-over. That's why I have a separate namenode configured on B.
  • Should I deploy Spark on each and every instances? Who are the worker nodes should be?
  • In which node should I deploy Hive? Is there a better alternative to Hive?
  • How should I connect JasperReports? And to where? To Hive or Spark?

Please tell me a suitable way to design the architecture? Please provide a elaborated answer.

Note that if you can provide any technical guides or case studies in similar nature it would be really helpful.


回答1:


You've figured it out, already! All my answers are merely general opinions and might drastically change depending on data, flavors of operations to be performed. Also question implies data and results of such operations are mission critical, I assumed so.

Spark on Hive will speed up things

Not necessarily correct. Anecdotal evidence, this post (by cloudera), proves the quite opposite. There is actually a move towards the vice-versa, i.e. Hive on Spark.

In which node should I deploy Spark? A or B, Given that I want to support fail-over. That's why I have a separate namenode configured on B. Should I deploy Spark on each and every instances? Who are the worker nodes should be?

Definitely - in most cases anyway. Set A or B as master, all of the rest can be worker nodes. If you don't want to have SPOF in your architecture, see high availability section of spark documentation, requires a bit of extra work.

Is there a better alternative to Hive?

This one is both subjective and task-specific. If SQL querying feels natural and fits the task, there is also Impala promoted by Cloudera, which claims to perform and order of magnitude faster than Hive. But is sort of a stranger in Apache Hadoop ecosystem. With Spark -and if you are fine typing a bit of python or scala- you can do SQL-like querying while still enjoying the expressive power these languages provide.

How should I connect JasperReports? And to where? To Hive or Spark?

Don't know about this one.



来源:https://stackoverflow.com/questions/33635234/hadoop-2-7-spark-hive-jasperreports-scoop-architecuture

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!