MarkLogic - java heap space error while importing with mlcp

问题

Marklogic version : 9.0-6.2 mlcp version: 9.0.6

I am trying to import XML file into marklogic using MLCP uisng below code.

#!/bin/bash
mlcp.sh import -ssl \
-host localhost \
-port 8010 \
-username uname \
-password pword \
-mode local \
-input_file_path /data/testsource/*.XML \
-input_file_type documents \
-aggregate_record_namespace "http://new.webservice.namespace" \
-output_collections testcol \
-output_uri_prefix /testuri/ \
-transform_module /ext/ingesttransform.sjs

The code is running successfully with a small file but giving 'java heap space' error when run with large file (450 MB).

ERROR contentpump.MultithreadedMapper: Error closing writer: Java heap space

How could we resolve this error?

回答1:

You can pass through Java heap settings to MLCP using the typical JVM_OPTS environment variable. Run java -X to see a list of all available options. I typically use these:

    -Xms<size>        set initial Java heap size
    -Xmx<size>        set maximum Java heap size
    -Xss<size>        set java thread stack size

You could invoke your script or MLCP like this:

JVM_OPTS="-Xmx1g" mlcp.sh ...

HTH!

回答2:

The mlcp job is designed to send the whole input file as one single document (-input_file_type documents) of size 500 MB into the transform module. The transform module has logic to spit uris and value (content.uri and content.value) for each aggregate element. This is resulting in java heap space error even though the heap space available on server is around 3.4 GB.

I tried two different designs that are working.

Add aggregation in mlcp (-input_file_type aggregates, -aggregate_record_element CustId) to spit into multiple documents. This creates multiple documents in staging DB
keep -input_file_type as documents and remove -transform_module, so the file is loaded as one single document into staging.

Both approaches are working, but the second approach may create documents with size of 500 MB (I believe the size limit is 512 MB). So I opted to use the first approach (also, I need a better uri than the default created by mlcp).

回答3:

To clarify about loading a single large document vs many documents - that will depend on your input. If your input file is one large document, it will be loaded without splitting unless you specify an XML or JSON element/property to split on. For instance, a phoneBook.xml with 100,000 entries or a big phone: [ ] JSON array should be split up.

However, if your document is already split up into many records (typically CSV or other text formats) then you don't need to specify how to split it, because the format uses newlines to separate records and mlcp knows this.

来源：https://stackoverflow.com/questions/54679739/marklogic-java-heap-space-error-while-importing-with-mlcp

标签

marklogic

marklogic-9

mlcp