H2O server crash | 易学教程

问题

I've been working with H2O for the last year, and I am getting very tired of server crashes. I have given up on "nightly releases", as they are easily crashed by my data sets. Please tell me where I can download a release that is stable.

Charles

My environment is:

Windows 10 enterprise, build 1607, with 64 GB memory.
Java SE Development Kit 8 Update 77 (64-bit).
Anaconda Python 3.6.2-0.

I started the server with:

localH2O = h2o.init(ip = "localhost",
                    port = 54321,
                    max_mem_size="12G",
                    nthreads = 4)

The h2o init information is:

H2O cluster uptime:         12 hours 12 mins
H2O cluster version:        3.10.5.2
H2O cluster version age:    1 month and 6 days
H2O cluster name:           H2O_from_python_Charles_ji1ndk
H2O cluster total nodes:    1
H2O cluster free memory:    6.994 Gb
H2O cluster total cores:    8
H2O cluster allowed cores:  4
H2O cluster status:         locked, healthy
H2O connection url:         http://localhost:54321
H2O connection proxy:
H2O internal security:      False
Python version:             3.6.2 final

The crash information is:

OSError: Job with key $03017f00000132d4ffffffff$_a0ce9b2c855ea5cff1aa58d65c2a4e7c failed with an exception: java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
stacktrace: 
java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
    at water.MemoryManager.set_goals(MemoryManager.java:97)
    at water.MemoryManager.malloc(MemoryManager.java:265)
    at water.MemoryManager.malloc(MemoryManager.java:222)
    at water.MemoryManager.arrayCopyOfRange(MemoryManager.java:291)
    at water.AutoBuffer.expandByteBuffer(AutoBuffer.java:719)
    at water.AutoBuffer.putA4f(AutoBuffer.java:1355)
    at hex.deeplearning.Storage$DenseRowMatrix$Icer.write129(Storage$DenseRowMatrix$Icer.java)
    at hex.deeplearning.Storage$DenseRowMatrix$Icer.write(Storage$DenseRowMatrix$Icer.java)
    at water.Iced.write(Iced.java:61)
    at water.AutoBuffer.put(AutoBuffer.java:771)
    at water.AutoBuffer.putA(AutoBuffer.java:883)
    at hex.deeplearning.DeepLearningModelInfo$Icer.write128(DeepLearningModelInfo$Icer.java)
    at hex.deeplearning.DeepLearningModelInfo$Icer.write(DeepLearningModelInfo$Icer.java)
    at water.Iced.write(Iced.java:61)
    at water.AutoBuffer.put(AutoBuffer.java:771)
    at hex.deeplearning.DeepLearningModel$Icer.write105(DeepLearningModel$Icer.java)
    at hex.deeplearning.DeepLearningModel$Icer.write(DeepLearningModel$Icer.java)
    at water.Iced.write(Iced.java:61)
    at water.Iced.asBytes(Iced.java:42)
    at water.Value.<init>(Value.java:348)
    at water.TAtomic.atomic(TAtomic.java:22)
    at water.Atomic.compute2(Atomic.java:56)
    at water.Atomic.fork(Atomic.java:39)
    at water.Atomic.invoke(Atomic.java:31)
    at water.Lockable.unlock(Lockable.java:181)
    at water.Lockable.unlock(Lockable.java:176)
    at hex.deeplearning.DeepLearning$DeepLearningDriver.trainModel(DeepLearning.java:491)
    at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:311)
    at hex.deeplearning.DeepLearning$DeepLearningDriver.computeImpl(DeepLearning.java:216)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
    at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:209)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

回答1:

You need a bigger boat.

The error message is saying "heapUsedGC=11482667352", which is higher than MEM_MAX. Instead of giving max_mem_size="12G" why not give it more of the 64GB you have? Or build a less ambitious model (fewer hidden nodes, less training data, something like that).

(Obviously, ideally, h2o shouldn't be crashing, and should instead be gracefully aborting when it gets close to using all the available memory. If you are able to share your data/code with H2O, it might be worth opening a bug report on their JIRA.)

BTW, I've been running h2o 3.10.x.x as the back-end for a web server process for 9 months or so, automatically restarting it at weekends, and haven't had a single crash. Well, I did - after I left it running 3 weeks and it filled up all the memory with more and more data and models. That is why I switched it to restart weekly, and only keep in memory the models I needed. (This is on an AWS instance, 4GB of memory, by the way; restarts done by cron job and bash commands.)

回答2:

You can always download the latest stable release from https://www.h2o.ai/download (there's a link labeled "latest stable release"). The latest stable Python package can be downloaded via PyPI and Anaconda; the latest stable R package is available on CRAN.

I agree with Darren that you probably need more memory -- if there is enough memory in your H2O cluster, H2O should not crash. We usually say that you should have a cluster that's at least 3-4x the size of your training set on disk in order to train a model. However, if you are building a grid of models, or many models, you will need to increase the memory so that you have enough RAM to store all those models as well.

来源：https://stackoverflow.com/questions/45333883/h2o-server-crash

标签

python-3.x

h2o