Counting all unique words in an unstructured document using index data

问题

I've loaded unstructured HTML documents into Marklogic and, for any given document URI, I need a way to use indexes/lexicons to provide a word count for all unique words.

For example, say I have the file below, saved under the URI "/html/example.html":

<html>
<head><title>EXAMPLE</title></head>
<body>
<h1>This is a header</h1>
<div class="highlight">This word is highlighted</div>
<p> And these words are inside a paragraph tag</p>
</body>
</html>

In XQuery, I'd call my function passing in a by passing in the URI and get the following results:

EXAMPLE 1
This 2
is 2
a 2
header 1
word 1
highlighted 1
And 1
these 1
words 1
are 1
inside 1
paragraph 1
tag 1

Note that I only need a word count on words inside of tags, not on the tags themselves.

Is there any way to do this efficiently (using index or lexicon data?)

Thanks,

grifster

回答1:

You're asking for word counts "for any given document URI". But you are assuming that the solution involves indexes or lexicons, and that's not necessarily a good assumption. If you want something document-specific from a document-oriented database, it's often best to work on the document directly.

So let's focus on an efficient word-count solution for a single document, and go from there. OK?

Here's how we could get word counts for a single element, including any children. This could be the root of your document: doc($uri)/*.

declare function local:word-count($root as element())
as map:map
{
  let $m := map:map()
  let $_ := cts:tokenize(
    $root//text())[. instance of cts:word]
    ! map:put($m, ., 1 + (map:get($m, .), 0)[1])
  return $m
};

This produces a map, which I find more flexible than flat text. Each key is a word, and the value is the count. The variable $doc already contains your sample XML.

let $m := local:word-count($doc)
for $k in map:keys($m)
return text { $k, map:get($m, $k) }

inside 1
This 2
is 2
paragraph 1
highlighted 1
EXAMPLE 1
header 1
are 1
word 1
words 1
these 1
tag 1
And 1
a 2

Note that the order of the map keys is indeterminate. Add an order by clause if you like.

let $m := local:word-count($doc)
for $k in map:keys($m)
let $v := map:get($m, $k)
order by $v descending
return text { $k, $v }

If you want to query the entire database, Geert's solution using cts:words might look pretty good. It uses a lexicon for the word list, and some index lookups for word matching. But it will end up walking the XML for every matching document for every word-lexicon word: O(nm). To do that properly the code will have to do work similar to what local:word-count does, but for one word at a time. Many words will match the same documents: 'the' might be in A and B, and 'then' might also be in A and B. Despite using lexicons and indexes, usually this approach will be slower than simply applying local:word-count to the whole database.

If you want to query the entire database and are willing to change the XML, you could wrap every word in a word element (or whatever element name you prefer). Then create an element range index of type string on word. Now you can use cts:values and cts:frequency to pull the answer directly from the range index. This will be O(n) with a much lower cost than the cts:words approach, and probably faster than local:word-count, because won't visit any documents at all. But the resulting XML is pretty clumsy.

Let's go back and apply local:word-count to the whole database. Start by tweaking the code so that the caller supplies the map. That way we can build up a single map that has word counts for the whole database, and we only look at each document once.

declare function local:word-count(
  $m as map:map,
  $root as element())
as map:map
{
  let $_ := cts:tokenize(
    $root//text())[. instance of cts:word]
    ! map:put($m, ., 1 + (map:get($m, .), 0)[1])
  return $m
};

let $m := map:map()
let $_ := local:word-count($m, collection()/*)
for $k in map:keys($m)
let $v := map:get($m, $k)
order by $v descending
return text { $k, $v }

On my laptop this processed 151 documents in less than 100-ms. There were about 8100 words and 925 distinct words. Getting the same results from cts:words and cts:search took just under 1-sec. So local:word-count is more efficient, and probably efficient enough for this job.

Now that you can build a word-count map efficiently, what if you could save it? In essence, you'd build our own "index" of word counts. This is easy, because maps have an XML serialization.

(: Construct a map. :)
map:map()
(: The document constructor creates a document-node with XML inside. :)
! document { . }
(: Construct a map from the XML root element. :)
! map:map(*)

So you could call local:word-count on each new XML document as it's inserted or updated. Then store the word-count map in the document's properties. Do this using a CPF pipeline, or using your own code via RecordLoader, or in a REST upload endpoint, etc.

When you want word counts for a single document, that's just a call to xdmp:document-properties or xdmp:document-get-properties, then call the map:map constructor on the right XML. If you want word counts for multiple documents, you can easily write XQuery to merge those maps into a single result.

回答2:

You would normally use cts:frequency for that purpose. Unfortunately that can only be supplied to values pulled from value lexicons, not on values from word lexicons. So I'm afraid you will have to do counting manually unless you can tokenize all words upfront into an element on which you can put a range index. The one thing closest I could come up with is:

for $word in cts:words()
let $freq := count(cts:search(doc()//*,$word))
order by $freq descending
return concat($word, ' - ', $freq)

Note: doc() will search across all documents, so that scales badly. But if you are interested in counts per doc, the performance might be good enough for you..

来源：https://stackoverflow.com/questions/25403223/counting-all-unique-words-in-an-unstructured-document-using-index-data

标签

marklogic