What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?

后端 未结 5 1352
孤街浪徒
孤街浪徒 2021-01-31 01:43

I am using Google Data Flow to implement an ETL data ware house solution.

Looking into google cloud offering, it seems DataProc can also do the same thing.

It

相关标签:
5条回答
  • 2021-01-31 02:19

    Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.

    An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles

    Quick takeaways:

    • Cloud Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if you are already familiar with Hadoop tools and have Hadoop jobs
    • Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP, and you do not need to address common aspects of running jobs on a cluster (e.g. Balancing work, or Scaling the number of workers for a job; by default, this is automatically managed for you, and applies to both batch and streaming) -- this can be very time consuming on other systems
      • Apache Beam is an important consideration; Beam jobs are intended to be portable across "runners," which include Cloud Dataflow, and enable you to focus on your logical computation, rather than how a "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works
      • Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values
    0 讨论(0)
  • 2021-01-31 02:23

    One of the other important difference is:

    Cloud Dataproc:

    Data mining and analysis in datasets of known size

    Cloud Dataflow:

    Manage datasets of unpredictable size

    see

    0 讨论(0)
  • 2021-01-31 02:29

    Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. You can decide which product is a better fit for your environment.

    Cloud Dataproc is good for environments dependent on specific Apache big data components: - Tools/packages - Pipelines - Skill sets of existing resources

    Cloud Dataflow is typically the preferred option for green field environments: - Less operational overhead - Unified approach to development of batch or streaming pipelines - Uses Apache Beam - Supports pipeline portability across Cloud Dataflow, Apache Spark, and Apache Flink as runtimes.

    See more details here https://cloud.google.com/dataproc/

    Pricing comparision:

    • DataProc

    • Dataflow

    If you want to calculate and compare cost of more GCP resources, please refer this url https://cloud.google.com/products/calculator/

    0 讨论(0)
  • 2021-01-31 02:32

    Same reason as why Dataproc offers both Hadoop and Spark: sometimes one programming model is the best fit for the job, sometimes the other. Likewise, in some cases the best fit for the job is the Apache Beam programming model, offered by Dataflow.

    In many cases, a big consideration is that one already has a codebase written against a particular framework, and one just wants to deploy it on the Google Cloud, so even if, say, the Beam programming model is superior to Hadoop, someone with a lot of Hadoop code might still choose Dataproc for the time being, rather than rewriting their code on Beam to run on Dataflow.

    The differences between Spark and Beam programming models are quite large, and there are a lot of use cases where each one has a big advantage over the other. See https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison .

    0 讨论(0)
  • 2021-01-31 02:41

    Here are three main points to consider while trying to choose between Dataproc and Dataflow

    • Provisioning
      Dataproc - Manual provisioning of clusters
      Dataflow - Serverless. Automatic provisioning of clusters

    • Hadoop Dependencies
      Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.

    • Portability
      Dataflow/Beam provides a clear separation between processing logic and the underlying execution engine. This helps with portability across different execution engines that support the Beam runtime, i.e. the same pipeline code can run seamlessly on either Dataflow, Spark or Flink.

    This flowchart from the google website explains how to go about choosing one over the other.

    https://cloud.google.com/dataflow/images/flow-vs-proc-flowchart.svg

    Further details are available in the below link
    https://cloud.google.com/dataproc/#fast--scalable-data-processing

    0 讨论(0)
提交回复
热议问题