问题
I'm trying to write a Casacading(v1.2) casade (http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/#N20844) consisting of two flows:
1) The first flow outputs url
s to a db table, (in which they are automatically assigned id's via an auto-incrementing id value).
This flow also outputs pairs of urls into a SequenceFile
with field names "urlTo
", "urlFrom
".
2) The second flow reads from both these sources and tries to do a CoGroup
on "urlTo
" (from the SequenceFile) and "url
" (from the db source) to get the db record "id
" for each "urlTo
".
It then does a CoGroup
on "urlFrom
" and "url
" to get the db record "id
" for each "urlFrom
".
The two flows work individually - if I call flow.complete() on the first before running the second flow. But if I put the two flows in a cascade object I get the error
cascading.cascade.CascadeException: no loops allowed in cascade, flow: urlLink*url*url, source: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='urls', columnNames=null, columnDefs=null, primaryKeys=null}}, sink: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='url_link', columnNames=[urlLinkFrom, urlLinkTo], columnDefs=[bigint(20), bigint(20)], primaryKeys=[urlLinkFrom, urlLinkTo]}}
on trying to configure the cascade.
I can see it's coming from the addEdgeFor
function of the CascadeConnector
but I'm not clear on how to resolve this problem.
I've never used Cascade
/ CascadeConnector
before. Is there something I'm missing?
回答1:
It seems like your some paths for source and sinks are the same.
A Cascade
uses the concept of Direct Graphs to build the Cascade itself so if you have a flow source and a sink source pointing to the same location that in essence creates a loop and is disallowed in the concept of Directed Graphs since
it does not go from:
Source
Location A toSink
Location B
but instead goes from:
Source
Location A toSink
Location A.
回答2:
"A Tap is not given an explicit name by design. This is so a given Tap instance can be re-used in different {@link Flow}s that may expect a source or sink by a different logical name, but are the same physical resource."
"In general, two instances of the same Tap class must have differing Identifiers (and different #equals)."
It turns out that JDBCTaps generate their identifier from the connection url alone (and do not include the table name). So as I was reading from one table and writing to a different table in the same database it seemed like I was reading from and writing to the same Tap and causing a loop.
As a work-around, I'm going to subclass the JDBCTap and override the getIdentifier() method to include the table name.
来源:https://stackoverflow.com/questions/17679363/hadoop-cascading-cascadeexception-no-loops-allowed-in-cascade-when-cogroup-p