Can anyone briefly explain about the Html annotator, Html converter and TEIViewWriter with some examples.I want to create annotations in the initial view.
Awaiting for the Answer.
Main Script:
PACKAGE uima.ruta.example;
SCRIPT uima.ruta.example.Html;
Document{-> EXEC(Html)};
WORDLIST JOURNALNAMELIST='JournalName.txt';
WORDLIST CITYPUBLIST='CITYPUB.txt';
DECLARE JOURNALNAME;
DECLARE CITYPUB;
Document{ -> MARKFAST(JOURNALNAME, JOURNALNAMELIST)};
Document{ -> MARKFAST(CITYPUB, CITYPUBLIST)};
DECLARE Reference;
"<a name=para(.+?)>(.+?)</a>"-> 2=Reference;
DECLARE FirstToken, LastToken;
BLOCK(InRef) Reference{}
{
ANY{POSITION(Reference,1) -> MARK(FirstToken)};
Document{-> MARKLAST(LastToken)};
}
DECLARE FIRSTWORD;
FirstToken PERIOD CW {->MARK(FIRSTWORD)};
Html Script:
PACKAGE uima.ruta.example;
ENGINE utils.HtmlAnnotator;
ENGINE utils.HtmlConverter;
ENGINE utils.HtmlViewWriter;
TYPESYSTEM utils.HtmlTypeSystem;
TYPESYSTEM utils.SourceDocumentInformation;
Document{-> EXEC(HtmlAnnotator)};
Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView","outputView" = "plain"),
EXEC(HtmlConverter)};
Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain","outputView" = "_InitialView", "output" = "E:/ruta-2.4.0-source-release/ruta-2.4.0/example-projects/TextRulerExample/output"),
EXEC(HtmlViewWriter)};
Sample Html Input file:(manually converted into html by changing extension)
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 14 (filtered)">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin-top:0in;
margin-right:0in;
margin-bottom:10.0pt;
margin-left:0in;
line-height:115%;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
span.DAZZLEFN
{mso-style-name:DAZZLEFN;}
span.DAZZLELN
{mso-style-name:DAZZLELN;
color:#92D050;}
.MsoChpDefault
{font-family:"Calibri","sans-serif";}
.MsoPapDefault
{margin-bottom:10.0pt;
line-height:115%;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
-->
</style>
</head>
<body lang=EN-US>
<div class=WordSection1>
<p class=MsoNormal><a name=para0>REFERENCES</a></p>
<p class=MsoNormal><a name=para1>1.����������� Lawrence RA. A review of the
medical benefits and contraindications to breastfeeding in the United States
[Internet] . Arlington (VA): National Center for Education in Maternal and
Child Health; 1997 Oct [cited 2000 Apr 24]. p. 40. Available from:
www.ncemch.org/pubs/PDFs/Welcometojungle.pdf.</a></p>
<p class=MsoNormal><a name=para2>2.����������� Shishido A. Retraction notice:
Effect of platinum compounds on murine lymphocyte mitogenesis [Retraction of
Alsabti EA, Ghalib ON, Salem MH. In: Jpn J Med Biol 1979 Apr; 32(2):53-65]. Jpn
J Med Sci Biol 1980 Aug;33(4):235-237.</a></p>
<p class=MsoNormal><a name=para3>3.����������� Leist TP, Zinkernagel RM.
Effects of treatment with IL-2 receptor specific monoclonal antibody in mice
[letter] [Retraction of Leist TP, Kohler M, Eppler M, Zinkernagel RM. In: J
Immunol 1989 Jul 15; 143(2): 628-32]. J Immunol 1990 Apr 1;144(7):2847.</a> </p>
<p class=MsoNormal><a name=para4>4.����������� Alsabti EA, Ghalib ON, Salem MH.
Effect of platinum compounds on murine lymphocyte mitogenesis [Retracted by
Shishido A. In: Jpn J Med Sci Biol 1980 Aug; 33(4):235-7]. Jpn J Med Sci Biol
1979 Apr;32(2):53-65.</a></p>
<p class=MsoNormal><a name=para5>5.����������� Tidy JA, Parry GC, Ward P,
Coleman DV, Peto J, Malcolm AD, Farrell PJ. High rate of papillomavirus type 16
infection in cytologically normal cervices [letter] [Retracted by Tidy J,
Farrell PJ. In: Lancet 1989 Dec 23-30:2(8678-8679):1535]. Lancet 1989 Feb 25;1(8635):434.</a></p>
<p class=MsoNormal><a name=para6>6.����������� Magni F, Rossoni G, Berti F.
BN-52021 protects guinea-pig from heard anaphylaxis. Pharm Res Commun 1988
Dec;20 Suppl 5:75-78.</a></p>
<p class=MsoNormal><a name=para7>7.����������� Garvia EE, DeHaven ED. An
experimental analysis of response acquisition and elimination with positive
reinforcers. Behav Neuropsychiatry 1975 a April-1976 May;7(1-12):71-78.</a> </p>
<p class=MsoNormal><a name=para8>8.����������� Mueller FO, Schindler RD. Annual
survey of football injury research 1931-1985. [place unknown]: American
Football Coaches Assn; 1986. 24 p.</a></p>
<p class=MsoNormal><a name=para9>9.����������� Stern, Michael P. National
Institute of Arthritis, Diabetes, and Digestive and Kidney Diseases. Diabetes
in America: diabetes data compiled 1984.. [Bethesda (MD)]: The Institute; 1985
Aug. Diabetes in Hispanic Americans. Chapter 9. (NIH publication; no. 86- 1468).</a></p>
<p class=MsoNormal><a name=para10>10.��������� Vivian, Valerie L, editor. Child
abuse and neglect: a medical community response. 1st AMA National Conference on
Child Abuse and Neglect; 1984 March 30-June 31; Chicago. Chicago: American
Medical Association; 1985. 256 p.</a></p>
<p class=MsoNormal><a name=para11>11.��������� Popper, Hans, et al., editors.
Structural carbohydrates in the liver: proceedings of the 34th Falk Symposium;
1982 oct 12-19; Basil, Switzerland.Boston: MTB Press; 1983. 701 p.</a></p>
<p class=MsoNormal><a name=para12></a> </p>
</div>
</body>
</html>
Note that you example script does not contain the mentioned TEIViewWriter. The problem is the same, however.
Unfortunately, the exemplary script has an error:
The line
Document{ -> CONFIGURE(ViewWriter, "inputView" = "plain",...
should read
Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain",
... then the NPE is gone. There could be another exception if the input text is not parseable by the HtmlParser resulting is a missing Sofa in the XMI file. Wrapping the text in could help here.
The files HtmlConverter.ruta and TEIConverter.ruta here are indeed good examples for these components The HtmlAnnotator creates annotations for HTML and XML tags/elements. The HtmlConverter removes all HTML/XML tags, stores the resulting text in a new view and recalculates the offsets of the annotations. The TEIViewWriter is just a ViewWriter with a specific type system, which copies a specific view to a new CAS and stores it. Together, these components are able to convert a TEI/Html/XML text to plain text with annotations for the xml markup.
The documentation contains more information, e.g., about the configuration parameters
DISCLAIMER: I am a developer of UIMA Ruta
来源:https://stackoverflow.com/questions/37154527/html-annotator-html-converter-in-uima-ruta