问题
I used UIMA in a process for analyzing and extracting information since text. The pipeline fails with 6 simultaneous processes.
I think that I need to use a scaleout tool, like UIMA-Ducc and UIMA-AS, but I don't see clearly which.
When to use each one? Which are their differences?
回答1:
UIMA-AS provides mechanisms for deploying a UIMA pipeline. Essentially, UIMA-AS allows users to put a queue in front of a UIMA component so that it can run in a different thread or in a different process. UIMA-AS handles threading and the interprocess transport of CASes. Other than some simple bash scripts, UIMA-AS does not provide life-cycle management for user processes.
DUCC is a cluster controller that, among other things, provides life-cycle management for UIMA-AS services. DUCC also provides a mechanism for scaling out a UIMA pipeline with multiple threads and multiple processes and feeding work to the pipeline instances; this is called a DUCC Job. DUCC jobs are created from core UIMA components, no knowledge of UIMA-AS required.
回答2:
As the UIMA Duckbook quotes
UIMA-AS provides a scale-out mechanism to distribute UIMA pipelines over a cluster of computing resources, but does not provide job or cluster management of the resources. DUCC defines a formal job model that closely maps to a standard UIMA pipeline. Around this job model DUCC provides cluster management services to automate the scale-out of UIMA pipelines over computing clusters.
Thus, if the task of job/cluster management is desired to be delegated to the framework, UIMA-DUCC should be used, else go for UIMA-AS.
To answer
Which are their differences?
the duckbook says,
DUCC provides other facilities in support of scale-out:
1. The ability to reserve all or part of a node in the cluster.
2. Automated management of services required in support of jobs.
3. The ability to schedule and execute arbitrary processes on nodes in the cluster.
4. Debugging tools and support.
5. A web server to display and manage work and cluster status.
6. A CLI and a Java API to support the above.
回答3:
The question should probably be: what are the advantages of using DUCC on top of UIMA-AS, because DUCC is a management layer on top of UIMA-AS.
If you just want to quickly deploy UIMA-AS pipelines, you are good with the basic UIMA-AS infrastructure (actually UIMA on top of Active MQ, http://activemq.apache.org/). Note however, that the examples in the UIMA-AS documentation only show you how to implement parallelism when processing, not when reading. This means that reading in data may become a bottleneck (unless you fully implement storing your data on different nodes as well as reading from different nodes).
This is actually one of the things that DUCC solves for you. If you follow DUCC best practices, your data reads can be distributed using the WorkItem type (which is put on top of a CAS). DUCC sort of forces you to do this (which is good), if you follow its CollectionReader (which partitions the input data into blocks) + CASMultiplier (to do the actual distributed read) approach you can get a huge performance increase. Additionally, DUCC gives you a Hadoop-like web-based monitoring interface and some other nice features, such as memory allocation per compute node.
If you are planning to run lots of pipelines and are willing to dig into DUCC, I'd definitely recommend DUCC. Of course, you'll have to learn UIMA-AS as well.
回答4:
The answer of your question in very easy language is "DUCC is the answer of all bottleneck you may be faced in UIMA or UIMA AS".
In DUCC, the monitoring can be possible with a ease as well as you can define the memory size of a process and the no of threads per process by just changing a job description file.
The other advantage of DUCC over Uima-as is now CR can also be scaled using Job Driver.
来源:https://stackoverflow.com/questions/29693732/uima-ducc-vs-uima-as