tool to extract data structures from unclean data

坚强是说给别人听的谎言 提交于 2019-12-13 05:43:44

问题


I have unstructured geneally unclean data in a database field. There are common structures which are consistent in the data

namely:

field:

name:value 

fieldset: 

nombre <FieldSet>
field,
  .
  .
  .
field(n)

table

nombre <table>
head(1)... head(n)
val(1)...  val(n)
      .
      .
      .

I was wondering if there was a tool (preferably in Java) that could extract learn/understand these data structures, parse the file and convert to a Map or object which I could run validation checks on?

I am aware of Antlr but understand this is more geared towards tree construction, an not independent bits of data (am I wrong about this?)

Does anyone have any suggestions for the problem as a whole?


回答1:


I recommend Talend. It is very versatile, open source data integration tool. It is based on java. You can use build in tools/components to extract data from unstructured data sources. You can also write complex custom java code to do what you want.

I used Talend in couple of scientific proof of concept projects of mine. It worked for me. Good part is, it is free!




回答2:


We ended up using antlr for this, it required us to make multiple lexers where one lexer would manipulated the input for the next lexer.

Another project is pads - wrote in C




回答3:


You should use "bnflite" https://github.com/r35382/bnflite Using this template library you need to develop BNF like gramma for your text by means of classes and overloaded operators directly in C++ code. The benefit is that such gramma is easily adjustable to your source



来源:https://stackoverflow.com/questions/5465374/tool-to-extract-data-structures-from-unclean-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!