Using predicates on a large database?

问题

I have a 50,000,000 document database that I'd like to write to a file the base-uri's for each document. Running the entire 50,000,000 is too long running (query times out). So, I thought I'd use predicates to break the database into more manageable batches. So, I tried the following to get a handle on its performance:

for $i in ( 49999000 to 50000000 )
return fn:base-uri( /mainDoc[position()=$i] )

But, performance was very slow for these 1000 base uris. In fact, the query timed out. I tried a similar query and got similar results (or lack of results):

for $i in ( /mainDoc ) [ 49999000 to 50000000 ]
return fn:base-uri( $i )

Is there a more performant method of looping through a large database, where documents at the end of the database are equally as quick to obtain as those at the beginning of the database?

回答1:

If you just want the document URIs, that easy. Ensure you have the document lexicon enabled and run a cts:uris() call.

To follow your approach to jump ahead in a document list to do something with each document, you can do the work unfiltered to make it fast:

for $item in cts:search(/mainDoc, cts:and-query(()), "unfiltered")[49999000 to 5000000]
return base-uri($item)

The cts:and-query(()) is a shortcut way to pass an always-true query.

回答2:

The most efficient way to use cts:uris would look something like this:

subsequence(cts:uris((), 'limit=50000000'), 49999000)

It would be even more efficient if you could pass in a start value, but that requires you to know the 49999000th value up-front.

cts:uris($start-value, 'limit=1000')

See http://docs.marklogic.com/cts:uris for more about that function.

来源：https://stackoverflow.com/questions/17685877/using-predicates-on-a-large-database

标签

marklogic