6D姿态估计从0单排——看论文的小鸡篇——Learning Analysis-by-Synthesis for 6D Pose Estimation in RGB-D Images

人盡茶涼 提交于 2020-05-01 01:41:06

迎来了第一篇使用CNN对姿态进行估计的文章了,哭了。 这篇文章是基于2014_Learning 6D Object Pose Estimation using 3D Object Coordinates(我们读过的)这篇文章,在14年的文章中作者把模型渲染的结果——一个像素点可能的模型坐标轴和位置、以及可能所属的object这两点用随机丛林来保存,然后通过像素级别的估计合成结果,最后利用一个energy function来评估pose渲染出的估计结果和实际值之间的误差来优化pose。这篇文章主要做的内容,就是在之前的随机森林的基础上,吧之前能量函数的部分用CNN来完成——用CNN对比模板生成的结果和实际观测的结果来生成能量值,从而利用能量值来精化得到的Pose。 Analysis-by-Synthesis: compare the observation with the output of a forward process, such as a rendered image of the object of interest in a particular pose. We propose an approach that "learns to compare", while taking these difficulties (occlusion, complicated sensor noise) into account. This is done by describing the posterior density of a particular object pose with a CNN that compares an observed and rendered image.

  1. The Pose Estimation Task: Our goal is to estimate the pose $H$ of a rigid object from a set of observations denoted by $x$. Each pose $H=(R,T)$ is a combination of two components. The rotational component $R$ is a $3\times3$ matrix describing the rotation around the center of the object. The translational component $T$ is a 3D vector corresponding to the position of the object center in the cemara coordinate system.
  2. Probabilistic Model: the possterior distribution of the pose $H$ given the observations $x$ as a Gibbs distribution: $p(H|x;\theta)=\frac{\exp(-E(H,x;\theta))}{\int\exp(-E(H,x;\theta))d\hat{H}}$, where $E(H,x;\theta)$ is so called energy function. The function is a mapping from a pose $H$ and the observed images $x$ to a real number, parametrized by the vector $\theta$. We implement it by using a CNN which directly outputs the energy value. $\theta$ holds the weights of our CNN.
  3. Convolutional Neural Network: we first render the object in pose $H$ to obtain rendered images $r(H)$. CNN compares $x$ with $r(H)$ and outputs a value $f(x,r(H);\theta)$. We define the function as: $E(H,x;\theta)=f(x,r(H);\theta)$. Our network is trained to assign a low energy values when there is a large agreement between observed images and renderings. we feed all rendered and observed images as separate input channels into the CNN. We consider only a square window around the center of the object with pose H. For performance reasons windows which are bigger than 100x100 pixels are down sampled to this size. And then there is the setting of CNN.
  4. Maximum Likelihood Training: In training we want to find an optimal set of parameters $\theta^$ based on labeled training data $L = (x_1,H_1)...(x_n,H_n)$, where $x_i$ shall denote observations of the the $i$-th training image and $H_i$ the corresponding ground truth pose. We apply the maximum likelihood paradigm and define: $\theta^={\arg\max}\theta\sum^n{i=1}\ln p(H_i|x_i;\theta)$. We use stochastic gradient descent to train : $\frac{\partial}{\partial \theta_j}\ln p(H_i|x_i;\theta)=-\frac{\partial}{\partial \theta_j}E(H_i,x;\theta)+\mathbb{E}[\frac{\partial}{\partial \theta_j}E(H,x_i;\theta)|x_i;\theta]$ with respect to each parameter $\theta_j$, $\mathbb{E}[|x_i;\theta]$ stands for the conditional expected value according to the posterior distribution $p(H_i|x_i;\theta)$ Sampling: approximate the expected value by a set of pose samples $\mathbb{E}[\frac{\partial}{\partial\theta_j}E(H,x_i;\theta)|x_i;\theta]\approx \frac{1}{N}\sum^N_{k=1}\frac{\partial}{\partial\theta_j}E(H_k,\hat{x};\theta)$, where $H_1...H_N$ are pose-samples drawn independently from the posterior $p(H|x;\theta)$ with the current parameters $\theta$. Metropolis algorithm generates a sequence of samples $H_t$ by repeating two steps: 1. Draw a new proposed sample $H'$ according to a proposal distribution $Q(H'|H_t)$, which the distribution has to be symmetric 2. Accept or reject the proposed sample according to an acceptance probability $A(H'|H_t)$. If the proposed sample is accepted set $H_{t+1}=H'$ else $H_{t+1}=H_t$. $A(H'|H_t)=min(1,\frac{p(H'|x;\theta)}{p(H_t|x;\theta)})$ Proposal Distribution: We define $Q(H'|H_t)$ implicitly by describing a sampling procedure and ensuring that it is symmetric. The translational component $T'$ of the proposed sample is directly drawn from a 3D isotropic normal distribution $N(T_t,\sum_T)$ centered at the translational component $T_t$ of the current sample $H_t$. The rotational component $R'$ of the proposed sample $H'$ is generated by applying random rotation $\hat{R}$ to the rotational component $R_t$ of the current sample: $R'=\hat{R}R_t$, $\hat{R}$ is calculated as the Euler vector(rotation matrix), which is drawn from a 3D zero centered isotropic normal distribution $e\sim N(0,\sum_R)$ Initialization and Burn-in-phase: To find a good initialization we run our inference procedure using the current parameter set. We then perform the Metropolis algorithm for a total of 130 iterations, disregarding the samples from the first 30 iterations which are considered as burn-in-phase.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!