上个月为小组搭建一个k8s的nvidia gpu集群,在此记录一下,以免以后忘记。
本次搭建采用的ubuntu18.04 server ,docker版本采用的19.03.2,k8s版本是1.15.2。
name | version |
ubuntu server | 18.04 |
docker | 19.03.2 |
k8s | 1.15.2 |
搭建集群之前需要安装nvidia显卡驱动,这里就不在赘述如何安装驱动。
集群需要设置固定ip,dns,否则容器可能不能访问外网。
通过shell脚本文件自动安装,install.sh文件如下:
1 #!/bin/bash 2 #安装ftp客户端 3 sudo apt-get install lftp 4 #修改时区 5 ln -snf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime 6 bash -c "echo 'Asia/Shanghai' > /etc/timezone" 7 8 #替换apt源为阿里源,先备份 9 echo "替换apt源为阿里源" 10 sudo mv /etc/apt/sources.list /etc/apt/sources.list.bak 11 sudo rm -f /etc/apt/sources.list.save 12 sudo cp -f sources.list /etc/apt 13 sudo apt-get update 14 15 #安装docker 16 sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common 17 curl -fsSL https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu/gpg | sudo apt-key add - 18 sudo add-apt-repository "deb [arch=amd64] https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu $(lsb_release -cs) stable" 19 sudo apt-get update 20 sudo apt-get install -y docker-ce=5:19.03.2~3-0~ubuntu-bionic docker-ce-cli=5:19.03.2~3-0~ubuntu-bionic 21 22 #安装nvidia-container,请确保已经安装了nvidia显卡驱动 23 distribution=$(. /etc/os-release;echo $ID$VERSION_ID) 24 curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - 25 curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list 26 sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit 27 apt-get install -y nvidia-container-runtime 28 29 #docker配置文件 30 mkdir -p /etc/docker 31 cp -f daemon.json /etc/docker 32 systemctl daemon-reload 33 systemctl restart docker 34 35 #安装k8s组件 36 curl -s https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add - 37 echo "deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list 38 sudo apt-get update 39 sudo apt install -y kubelet=1.15.2-00 kubeadm=1.15.2-00 kubectl=1.15.2-00 40 sudo apt-mark hold kubelet=1.15.2-00 kubeadm=1.15.2-00 kubectl=1.15.2-00 41 cp -f 10-kubeadm.conf /etc/systemd/system/kubelet.service.d/ 42 43 #dns设置 44 cp -f resolved.conf /etc/systemd/resolved.conf
以上就是安装脚本,其中阿里apt源文件如下:
#sources.list deb http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic-security main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic-updates main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic-backports main restricted universe multiverse deb http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic-proposed main restricted universe multiverse
docker daemon.json文件如下:
{ "exec-opts": ["native.cgroupdriver=systemd"], "registry-mirrors":["http://hub-mirror.c.163.com","https://registry.docker-cn.com","https://docker.mirrors.ustc.edu.cn","https://pee6w651.mirror.aliyuncs.com"], "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
kubeadm的配置文件10-kubeadm.conf如下
# Note: This dropin only works with kubeadm and kubelet v1.11+ [Service] Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --runtime-cgroups=/systemd/system.slice --kubelet-cgroups=/systemd/system.slice" Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml" # This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env # This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use # the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file. EnvironmentFile=-/etc/default/kubelet ExecStart= ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
ubuntu18.04 静态ip设置通过netplan方式,文件为50-cloud-init.yaml,格式如下:
# This file is generated from information provided by # the datasource. Changes to it will not persist across an instance. # To disable cloud-init's network configuration capabilities, write a file # /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following: # network: {config: disabled} network: ethernets: enp4s0: dhcp4: no addresses: [10.254.18.6/24] gateway4: 10.254.18.1 version: 2
dns配置文件resolved.conf,格式如下:
# This file is part of systemd. # # systemd is free software; you can redistribute it and/or modify it # under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or # (at your option) any later version. # # Entries in this file show the compile time defaults. # You can change settings by editing this file. # Defaults can be restored by simply deleting this file. # # See resolved.conf(5) for details [Resolve] DNS=192.168.110.213 114.114.114.114 #FallbackDNS= #Domains= LLMNR=no #MulticastDNS=no #DNSSEC=no #Cache=yes #DNSStubListener=yes
将上述shell脚本文件install.sh、阿里源sources.list文件、docker的daemon.json文件、静态ip设置文件50-cloud-init.yaml、dns配置文件resolved.conf放在同一目录,然后运行bash install.sh即可自动安装。
如果需要安装其他版本软件,修改脚本文件即可。
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
以上步骤需要在每台机器上面执行,如何初始化k8s集群,以及如何添加节点到k8s集群中,可以根据https://blog.csdn.net/shykevin/article/details/98811021文章进行操作,但是文章中有一个地方需要注意,
sudo kubeadm init --image-repository registry.aliyuncs.com/google_containers --kubernetes-version v1.15.2 --pod-network-cidr=192.169.0.0/16
这里的pod-network-cidr使用的192.169.0.0,所以在添加calico网络插件的时候,需要修改calico配置文件(http://mirror.faasx.com/k8s/calico/v3.3.2/calico.yaml)
- name: CALICO_IPV4POOL_CIDR value: "192.168.0.0/16"
修改为:
- name: CALICO_IPV4POOL_CIDR value: "192.169.0.0/16"
否则,容器将无法访问外网。
gpu插件采用的是nvidia-device-plugin,如下:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
参考文档如下:https://feisky.gitbooks.io/kubernetes/content/plugins/device.html
来源:https://www.cnblogs.com/tiny1987/p/12015866.html