问题
I have a 50,000,000 document database that I'd like to write to a file the base-uri's for each document. Running the entire 50,000,000 is too long running (query times out). So, I thought I'd use predicates to break the database into more manageable batches. So, I tried the following to get a handle on its performance:
for $i in ( 49999000 to 50000000 )
return fn:base-uri( /mainDoc[position()=$i] )
But, performance was very slow for these 1000 base uris. In fact, the query timed out. I tried a similar query and got similar results (or lack of results):
for $i in ( /mainDoc ) [ 49999000 to 50000000 ]
return fn:base-uri( $i )
Is there a more performant method of looping through a large database, where documents at the end of the database are equally as quick to obtain as those at the beginning of the database?
回答1:
If you just want the document URIs, that easy. Ensure you have the document lexicon enabled and run a cts:uris()
call.
To follow your approach to jump ahead in a document list to do something with each document, you can do the work unfiltered to make it fast:
for $item in cts:search(/mainDoc, cts:and-query(()), "unfiltered")[49999000 to 5000000]
return base-uri($item)
The cts:and-query(())
is a shortcut way to pass an always-true query.
回答2:
The most efficient way to use cts:uris
would look something like this:
subsequence(cts:uris((), 'limit=50000000'), 49999000)
It would be even more efficient if you could pass in a start value, but that requires you to know the 49999000th value up-front.
cts:uris($start-value, 'limit=1000')
See http://docs.marklogic.com/cts:uris for more about that function.
来源:https://stackoverflow.com/questions/17685877/using-predicates-on-a-large-database