问题
CPU usage too high while running Ruta Script.So I plan to use GPU. Whether I need to do any additional process to run the script in GPU machine. Orelse is there any alternative solution to reduce the CPU usage
Sample Script:
PACKAGE uima.ruta.example;
ENGINE utils.PlainTextAnnotator;
TYPESYSTEM utils.PlainTextTypeSystem;
WORDLIST EditorMarkerList = 'EditorMarker.txt';
WORDLIST EnglishStopWordList = 'EnglishStopWords.txt';
WORDLIST FirstNameList = 'FirstNames.txt';
WORDLIST JournalVolumeMarkerList = 'JournalVolumeMarker.txt';
WORDLIST MonthList = 'Months.txt';
WORDLIST PagesMarkerList = 'PagesMarker.txt';
WORDLIST PublisherList = 'Publishers.txt';
DECLARE EditorMarker, EnglishStopWord, FirstName, JournalVolumeMarker, Month, PagesMarker, PublisherInd;
Document{ -> MARKFAST(EditorMarker, EditorMarkerList)};
Document{ -> MARKFAST(EnglishStopWord,EnglishStopWordList)};
Document{ -> MARKFAST(FirstName, FirstNameList)};
Document{ -> MARKFAST(JournalVolumeMarker, JournalVolumeMarkerList)};
Document{ -> MARKFAST(Month, MonthList)};
Document{ -> MARKFAST(PagesMarker, PagesMarkerList)};
Document{ -> MARKFAST(PublisherInd, PublisherList)};
DECLARE Reference;
Document{-> EXEC(PlainTextAnnotator, {Line, Paragraph})};
Document{-> RETAINTYPE(SPACE, BREAK)};
Line{-REGEXP("CORA:.*") -> MARK(Reference)};
Reference{-> TRIM(SPACE, BREAK)};
Document{-> RETAINTYPE};
DECLARE LParen, RParen;
SPECIAL{REGEXP("[(]") -> MARK(LParen)};
SPECIAL{REGEXP("[)]") -> MARK(RParen)};
DECLARE YearInd;
NUM{REGEXP("19..|20..") -> MARK(YearInd, 1, 2)} SW?{REGEXP("a|b|c|d", true)};
Document{-> RETAINTYPE(SPACE)};
CAP YearInd{-> UNMARK(YearInd)};
Document{-> RETAINTYPE};
DECLARE NameLinker;
W{-PARTOF(NameLinker), REGEXP("and", true) -> MARK(NameLinker)};
COMMA{-PARTOF(NameLinker) -> MARK(NameLinker)};
SEMICOLON{-PARTOF(NameLinker) -> MARK(NameLinker)};
SPECIAL{-PARTOF(NameLinker), REGEXP("&") -> MARK(NameLinker)};
DECLARE FirstNameInd, FirstNameInitial, SingleChar;
CW{-PARTOF(FirstNameInitial), REGEXP(".")} SPECIAL{- PARTOF(FirstNameInitial), REGEXP("-")} CW{REGEXP(".") -> MARK(FirstNameInitial,1,2,3,4)} PERIOD;
SPECIAL{-PARTOF(FirstNameInitial), REGEXP("-")} CW{REGEXP(".") -> MARK(FirstNameInitial,1,2,3)} PERIOD;
CW{-PARTOF(FirstNameInitial), REGEXP(".") -> MARK(FirstNameInitial,1,2)} PERIOD;
CW{-PARTOF(FirstNameInitial), REGEXP(".") -> MARK(FirstNameInitial)} COMMA;
CW{-PARTOF(FirstNameInitial), REGEXP(".") -> MARK(SingleChar)};
DECLARE Quote, QuotedStuff;
SPECIAL[1,2]{REGEXP("[\"'´`‘’“]"), -PARTOF(Quote) -> MARK(Quote)};
Document{-> RETAINTYPE(SPACE)};
W Quote{-> UNMARK(Quote)} W;
Document{-> RETAINTYPE};
BLOCK(InRef) Reference{}{
Quote ANY+{-PARTOF(Quote) -> MARK(QuotedStuff, 1, 2, 3)} Quote;
}
DECLARE InInd;
W{REGEXP("In", true)-> MARK(InInd)};
DECLARE FirstToken, LastToken;
BLOCK(InRef) Reference{}{
ANY{POSITION(Reference,1) -> MARK(FirstToken)};
Document{-> MARKLAST(LastToken)};
}
DECLARE NumPeriod, NumComma, NumColon;
Document{-> RETAINTYPE(SPACE, BREAK)};
NUM PERIOD{-> MARKONCE(NumPeriod)} NUM;
NUM COMMA{-> MARKONCE(NumComma)} NUM;
NUM COLON{-> MARKONCE(NumColon)} NUM;
Document{-> RETAINTYPE};
DECLARE PeriodSep, CommaSep, ColonSep;
PERIOD{-PARTOF(FirstNameInitial), -PARTOF(NumPeriod), -PARTOF(FirstToken) -> MARKONCE (PeriodSep)};
COMMA{-PARTOF(FirstNameInitial), -PARTOF(NumComma), - PARTOF(FirstToken) -> MARKONCE (CommaSep)};
COLON{-PARTOF(FirstNameInitial), -PARTOF(NumColon), -PARTOF(FirstToken) -> MARKONCE (ColonSep)};
回答1:
I have no experience running Ruta on a GPU and if this brings any advantages compared to some parallelized process with multiple CPUs.
Ruta became more and more imperative with the consequence that you can write fast but also slow rules, depending on how much you care.
Loosely speaking, each rule is an iterator over a specific type of annotation. If you have many iterators over general types, you have many index operations in UIMA. Index operation are the main source time is spent on, thus they should be reduced, e.g., by reducing annotations or selecting better iterators/rules.
Your rule example contains many potential options to improve the runtime (only a first iteration of optimization):
Each MARKFAST causes two nested iterators over RutaBasic (all atomic text spans) which is iterating seven times over the complete document. Rather compile the rules into an mtwl and use the TRIE action. Here's an example how to do that: ruta-german-novel-with-dkpro
You have several duplicate starting anchors in consecutive rules, e.g., line 32+33. You can iterate once over SPECIAL with BLOCK or inlined rules:
SPECIAL->{Document{REGEXP("[(]") -> MARK(LParen)};Document{REGEXP("[)]") -> MARK(RParen)};};
You can even combine it with the other similar rules by iteration once over ANY and classifying all of them only once.Your rules do not apply dynamic anchoring, you do not specify the starting anchor of the rule match. The rule in line 58 for example needs to iterate over all words. This is not necessary as you can also iterate only over all Quote annotations with
W @Quote{-> UNMARK(Quote)} W;
which is much faster. There are several rules which can be optimized this way.If you have the same iterators but have additional a sequential dependency like in line 49-53, you should use the FOREACH block. Here, you can iterate over CWs and apply several rules anchoring on each CW.
Some conditions are really slow. For example, you should avoid POSITION (line 69) and replace it with the MARKFIRST action.
As Renaud mentioned, the Ruta Workbench provides profiling functionality. It displays which part of your script (rule, block) took how long, and also which language element (condition, action) required most of the time. There you get a good indicator which parts are worth to be optimized.
DISCLAIMER: I am a developer of UIMA Ruta
来源:https://stackoverflow.com/questions/40218061/cpu-usage-too-high-while-running-ruta-script