Hadoop Cascading : CascadeException “no loops allowed in cascade” when cogroup pipes twice

旧巷老猫 提交于 2019-12-11 03:27:59

问题


I'm trying to write a Casacading(v1.2) casade (http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/#N20844) consisting of two flows:

1) The first flow outputs urls to a db table, (in which they are automatically assigned id's via an auto-incrementing id value). This flow also outputs pairs of urls into a SequenceFile with field names "urlTo", "urlFrom".

2) The second flow reads from both these sources and tries to do a CoGroup on "urlTo" (from the SequenceFile) and "url" (from the db source) to get the db record "id" for each "urlTo".

It then does a CoGroup on "urlFrom" and "url" to get the db record "id" for each "urlFrom".

The two flows work individually - if I call flow.complete() on the first before running the second flow. But if I put the two flows in a cascade object I get the error

cascading.cascade.CascadeException: no loops allowed in cascade, flow: urlLink*url*url, source: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='urls', columnNames=null, columnDefs=null, primaryKeys=null}}, sink: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='url_link', columnNames=[urlLinkFrom, urlLinkTo], columnDefs=[bigint(20), bigint(20)], primaryKeys=[urlLinkFrom, urlLinkTo]}}

on trying to configure the cascade.

I can see it's coming from the addEdgeFor function of the CascadeConnector but I'm not clear on how to resolve this problem.

I've never used Cascade / CascadeConnector before. Is there something I'm missing?


回答1:


It seems like your some paths for source and sinks are the same.

A Cascade uses the concept of Direct Graphs to build the Cascade itself so if you have a flow source and a sink source pointing to the same location that in essence creates a loop and is disallowed in the concept of Directed Graphs since

it does not go from:

  • Source Location A to Sink Location B

but instead goes from:

  • Source Location A to Sink Location A.



回答2:


"A Tap is not given an explicit name by design. This is so a given Tap instance can be re-used in different {@link Flow}s that may expect a source or sink by a different logical name, but are the same physical resource."

"In general, two instances of the same Tap class must have differing Identifiers (and different #equals)."

It turns out that JDBCTaps generate their identifier from the connection url alone (and do not include the table name). So as I was reading from one table and writing to a different table in the same database it seemed like I was reading from and writing to the same Tap and causing a loop.

As a work-around, I'm going to subclass the JDBCTap and override the getIdentifier() method to include the table name.



来源:https://stackoverflow.com/questions/17679363/hadoop-cascading-cascadeexception-no-loops-allowed-in-cascade-when-cogroup-p

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!