I have a big collection of about 1 million documents, in MarkLogic (version 9), with below structure:
There are various external tools out there, like Corb2, and MLCP that can be used for this, but you can also do adhoc or less adhoc work from inside MarkLogic. All you essentially need to do is do your processing in batches. Taskbot is very useful for that:
https://github.com/mblakele/taskbot
HTH!