Html Annotator,Html Converter in Uima Ruta

僤鯓⒐⒋嵵緔 提交于 2019-12-05 21:31:20

Note that you example script does not contain the mentioned TEIViewWriter. The problem is the same, however.

Unfortunately, the exemplary script has an error:

The line

Document{ -> CONFIGURE(ViewWriter, "inputView" = "plain",...

should read

Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain",

... then the NPE is gone. There could be another exception if the input text is not parseable by the HtmlParser resulting is a missing Sofa in the XMI file. Wrapping the text in could help here.

The files HtmlConverter.ruta and TEIConverter.ruta here are indeed good examples for these components The HtmlAnnotator creates annotations for HTML and XML tags/elements. The HtmlConverter removes all HTML/XML tags, stores the resulting text in a new view and recalculates the offsets of the annotations. The TEIViewWriter is just a ViewWriter with a specific type system, which copies a specific view to a new CAS and stores it. Together, these components are able to convert a TEI/Html/XML text to plain text with annotations for the xml markup.

The documentation contains more information, e.g., about the configuration parameters

DISCLAIMER: I am a developer of UIMA Ruta

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!