YARN job appears to have access to less resources than Ambari YARN manager reports

≯℡__Kan透↙ 提交于 2019-12-24 21:03:41

问题


Getting confused when trying to run a YARN process and getting errors. Looking in ambari UI YARN section, seeing... (note it says 60GB available). Yet, when trying to run an YARN process, getting errors indicating that there are less resources available than is being reported in ambari, see...

➜  h2o-3.26.0.2-hdp3.1 hadoop jar h2odriver.jar -nodes 4 -mapperXmx 5g -output /home/ml1/hdfsOutputDir
Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: 192.168.122.1]
    [Possible callback IP address: 172.18.4.49]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46721
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
    mapreduce.map.java.opts:     -Xms5g -Xmx5g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     5632
Hive driver not present, not generating token.
19/08/07 12:37:19 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/08/07 12:37:19 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/08/07 12:37:19 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1/.staging/job_1565057088651_0007
19/08/07 12:37:21 INFO mapreduce.JobSubmitter: number of splits:4
19/08/07 12:37:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1565057088651_0007
19/08/07 12:37:21 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/08/07 12:37:21 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/08/07 12:37:21 INFO impl.YarnClientImpl: Submitted application application_1565057088651_0007
19/08/07 12:37:21 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/
Job name 'H2O_80092' submitted
JobTracker job ID is 'job_1565057088651_0007'
For YARN users, logs command is 'yarn logs -applicationId application_1565057088651_0007'
Waiting for H2O cluster to come up...
19/08/07 12:37:38 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/08/07 12:37:38 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200

----- YARN cluster metrics -----
Number of YARN worker nodes: 4

----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://hw05.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used

----- Queues -----
Queue name:            default
    Queue state:       RUNNING
    Current capacity:  0.08
    Capacity:          1.00
    Maximum capacity:  1.00
    Application count: 1
    ----- Applications in this queue -----
    Application ID:                  application_1565057088651_0007 (H2O_80092)
        Started:                     ml1 (Wed Aug 07 12:37:21 HST 2019)
        Application state:           FINISHED
        Tracking URL:                http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/
        Queue name:                  default
        Used/Reserved containers:    1 / 0
        Needed/Used/Reserved memory: 5.0 GB / 5.0 GB / 0.0 GB
        Needed/Used/Reserved vcores: 1 / 1 / 0

Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12 vcores used

----------------------------------------------------------------------

ERROR: Unable to start any H2O nodes; please contact your YARN administrator.

       A common cause for this is the requested container size (5.5 GB)
       exceeds the following YARN settings:

           yarn.nodemanager.resource.memory-mb
           yarn.scheduler.maximum-allocation-mb

----------------------------------------------------------------------

For YARN users, logs command is 'yarn logs -applicationId application_1565057088651_0007'

Note the

ERROR: Unable to start any H2O nodes; please contact your YARN administrator.

A common cause for this is the requested container size (5.5 GB) exceeds the following YARN settings:

  yarn.nodemanager.resource.memory-mb
  yarn.scheduler.maximum-allocation-mb

Yet, I have YARN configured with

yarn.scheduler.maximum-allocation-vcores=3
yarn.nodemanager.resource.cpu-vcores=3
yarn.nodemanager.resource.memory-mb=15GB
yarn.scheduler.maximum-allocation-mb=15GB

and we can see both container and node resource restrictions are higher than the requested container size.

Trying to do a heftier calculation with the default mapreduce pi example

[myuser@HW03 ~]$ yarn jar /usr/hdp/3.1.0.0-78/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 1000 1000
Number of Maps  = 1000
Samples per Map = 1000
....

and checking the RM UI, I can see that it is at least possible in some cases to use all of the RM's 60GB of resources (notice the 61440MBs in the bottom of the image)

So there are some things about the problem that I don't understand

  1. Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12 vcores used

    I would like to use the full 60GB that YARN can ostensibly provide (or at least have the option to, rather than have errors thrown). Would think that there should be enough resources to have each of the 4 nodes provide 15GB (> requested 4x5GB=20GB) to the process. Am I missing something here? Note that I only have the default root queue setup for YARN?

  2. ----- Nodes -----

    Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3 vcores used

    Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used

    ....

    Why is only a single node being used before erroring out?

From these two things, it seems that neither the 15GB node limit nor the 60GB cluster limit are being exceeded, so why are these errors being thrown? What about this situation am I misinterpreting here? What can be done to fix (again, would like to be able to use all of the apparent 60GB of YARN resources for the job without error)? Any debugging suggestions of fixes?

UPDATE:

Problem appears to be related to How to properly change uid for HDP / ambari-created user? and the fact that having a user exist on a node and have a hdfs://user/<username> directory with correct permissions (as I was lead to believe from a Hortonworks forum post) is not sufficient to be acknowledges as "existing" on the cluster.

Running the hadoop jar command for a different user (in this case, the Ambari-created hdfs user) that exists on all cluster nodes (even though Ambari created this user having different uids across nodes (IDK if this is a problem)) and has a hdfs://user/hdfs dir, found that the h2o jar ran as expected.

I was initially under the impression that users only needed to exist on whatever client machine was being used plus the need for a hdfs://user/ dir (see https://community.cloudera.com/t5/Support-Questions/Adding-a-new-user-to-the-cluster/m-p/130319/highlight/true#M93005). One concerning / confusing thing that has come from this is the fact Ambari apparently created the hdfs user on the various cluster nodes with differing uid and gid values, eg...

[root@HW01 ~]# clush -ab id hdfs
---------------
HW[01-04] (4)
---------------
uid=1017(hdfs) gid=1005(hadoop) groups=1005(hadoop),1003(hdfs)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)
[root@HW01 ~]# 
[root@HW01 ~]#
# wondering what else is using a uid 1021 across the nodes 
[root@HW01 ~]# clush -ab id 1021
---------------
HW[01-04] (4)
---------------
uid=1021(hbase) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)

This does not seem like that is how it is supposed to be (just my suspicion from having worked with MapR (which requires the uid and gids to be same across nodes) and looking here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_BDA_SHR/bl1adv_userandgrpid.htm). Note that HW05 was a node that was added later. If this is actually fine in HDP, I plan to just add the user I actually indent to use h2o with across all the nodes with whatever arbitrary uid and gid values. Any thoughts on this? Any docs to support either why this is right or wrong you could link me to?

Will look into this a bit more before posting as an answer. I think basically will need to look for a bit more clarification as to when HDP considers a user to "exist" on a cluster.


回答1:


Problem appears to be related to How to properly change uid for HDP / ambari-created user? and the fact that having a user exist on a node and have a hdfs://user/ directory with correct permissions (as I was lead to believe from a Hortonworks forum post) is not sufficient to be acknowledges as "existing" on the cluster. This jives with discussions I've had with Hortonworks experts where they have said that the YARN-using user must exist on all of the cluster's datanodes.

Running the hadoop jar command for a different user (in this case, the Ambari-created hdfs user) that exists on all cluster nodes (even though Ambari created this user having different uids across nodes (IDK if this is a problem)) and has a hdfs://user/hdfs dir, found that the h2o jar ran as expected.

I was initially under the impression that users only needed to exist on whatever client machine was being used plus the need for a hdfs://user/ dir (see https://community.cloudera.com/t5/Support-Questions/Adding-a-new-user-to-the-cluster/m-p/130319/highlight/true#M93005).


Side note:

One concerning / confusing thing that has come from this is the fact Ambari apparently created the hdfs user on the various cluster nodes with differing uid and gid values, eg...

[root@HW01 ~]# clush -ab id hdfs
---------------
HW[01-04] (4)
---------------
uid=1017(hdfs) gid=1005(hadoop) groups=1005(hadoop),1003(hdfs)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)
[root@HW01 ~]# 
[root@HW01 ~]#
# wondering what else is using a uid 1021 across the nodes 
[root@HW01 ~]# clush -ab id 1021
---------------
HW[01-04] (4)
---------------
uid=1021(hbase) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)

This does not seem like that is how it is supposed to be (just my suspicion from having worked with MapR (which requires the uid and gids to be same across nodes) and looking here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_BDA_SHR/bl1adv_userandgrpid.htm). Note that HW05 was a node that was added later. If this is actually fine in HDP, I plan to just add the user I actually indent to use h2o with across all the nodes with whatever arbitrary uid and gid values. Any thoughts on this? Any docs to support either why this is right or wrong you could link me to?

Looking into this a bit more here: HDFS NFS locations using weird numerical username values for directory permissions



来源:https://stackoverflow.com/questions/57226758/yarn-job-appears-to-have-access-to-less-resources-than-ambari-yarn-manager-repor

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!