MyDLNote-Event : CVPR 2020 Event Enhanced High-Quality Image Recovery基于事件相机的高质量图像修复

守給你的承諾、 提交于 2020-10-14 20:17:31

Event Enhanced High-Quality Image Recovery

Dataset, code, and more results are available at: https://github.com/Shi nyWang33/eSL-Net. (目前好像还不能用)

[paper] : https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123580154.pdf

 

My Note:

这篇文章的重大贡献是:

1. 同时处理事件相机的噪声、模糊和超分辨的强度图像重建;

2. 整个网络是可解释的,网络结构就是对稀疏编码解的结构;

3. 构造了一个数据集,包括 HR 的清晰强度图像;LR 的清晰强度图像;LR 的含噪声和运动模糊图像;事件相机序列。

本文是一个监督学习模型,从给定的 HR 清晰图像通过人工方式,得到一系列待处理的数据。看似不太合理,人工合成图像真的能适用于真实场景吗?因此,本文的结论特别指出:可以。

额,如果可以有改进,有两点:1,那当然不要用人工合成数据;2, 如果不用人工合成,那 pair 的图像好像不好获得,因此,要考虑无监督方法。

 


Abstract

With extremely high temporal resolution, event cameras have a large potential for robotics and computer vision. However, their asynchronous imaging mechanism often aggravates the measurement sensitivity to noises and brings a physical burden to increase the image spatial resolution.

To recover high-quality intensity images, one should address both denoising and super-resolution problems for event cameras. Since events depict brightness changes, with the enhanced degeneration model by the events, the clear and sharp high-resolution latent images can be recovered from the noisy, blurry and low-resolution intensity observations. Exploiting the framework of sparse learning, the events and the low-resolution intensity observations can be jointly considered.

Based on this, we propose an explainable network, an event-enhanced sparse learning network (eSL-Net), to recover the high-quality images from event cameras. After training with a synthetic dataset, the proposed eSL-Net can largely improve the performance of the state-of-the-art by 7-12 dB. Furthermore, without additional training process, the proposed eSL-Net can be easily extended to generate continuous frames with frame-rate as high as the events.

其非同步成像机制往往加重了测量对噪声的敏感性,为提高图像空间分辨率带来了物理负担

为了恢复高质量的强度图像,应该同时解决去噪和事件相机的超分辨率问题。由于事件描述了亮度的变化,随着事件增强的退化模型,清晰和锐利的高分辨率潜在图像可以从嘈杂,模糊和低分辨率强度的观察恢复。利用稀疏学习的框架,事件和低分辨率强度的观测可以联合考虑。

在此基础上,本文提出了一种可解释网络,即事件增强稀疏学习网络 (eSL-Net),用于从事件摄像机中恢复高质量的图像。在经过合成数据集的训练后,提出的 eSL-Net 可以将最先进的性能提高 7-12 dB。此外,不需要额外的训练过程,所提出的 eSL-Net 可以很容易地扩展成具有与事件相同帧率的连续帧。

 

 

Introduction

背景介绍:

与标准的基于帧的相机不同,事件相机是一种仿生传感器,可以产生延时极低 (1 µs) 的异步事件,从而产生极高的时间分辨率。它不会造成运动模糊,并且在低/高水平的视觉任务中具有很高的吸引力。然而,生成的事件流只能描述场景的变化,而不能描述绝对强度测量值。同时,异步数据驱动机制,使得标准相机设计的算法无法直接应用于事件相机。因此,从事件流中重建高质量的强度图像是可视化的基本要求,并提供了巨大的潜力。

问题提出:

为了实现低延迟特性,事件摄像机独立地捕获每个像素的亮度变化。这种机制加重了测量对噪声的敏感性,给提高图像空间分辨率带来了极大的难度。因此,从事件相机中恢复高质量的图像是一个非常具有挑战性的问题,需要同时解决以下问题:

Low frame-rate and blurry intensity images: The APS (Active Pixel Sensor) frames are with relatively low frame-rate (≥ 5 ms latency). And the motion blur is inevitable when recording highly dynamic scenes.

低帧率和模糊强度图像:APS (Active Pixel Sensor) 帧帧率相对较低 (延时≥5 ms)。在记录高度动态的场景时,运动模糊是不可避免的

– High level and mixed noises: The thermal effects or unstable light environment can produce a huge amount of noisy events. Together with the noises from APS frames, the reconstruction of intensity image would fall into a mixed noises problem.

高电平和混合噪声:热效应或不稳定的光环境可以产生大量的噪声事件。再加上 APS 帧中的噪声,强度图像的重建会陷入混合噪声问题。

– Low spatial-resolution: The leading commercial event cameras are typically with very low spatial-resolution. And there is a balance between the spatial-resolution and the latency.

低空间分辨率:主要的商业活动摄像机通常具有非常低的空间分辨率。在空间分辨率和延迟之间有一个平衡。

传统的方法:

针对事件相机图像恢复中存在的噪声问题,提出了各种方法。Barua 等 [3] 首先提出了一种基于学习的基于稀疏正则化的图像梯度平滑方法,然后利用泊松积分的方法从去噪后的图像梯度中恢复灰度图像。Munda 等 [19] 在事件时间表面引入了流形正则化,并提出了一种实时强度重建算法。使用这些手工制作的正则化,噪声可以很大程度上得到缓解,但同时也产生了一些伪影 (例如模糊的边缘)。最近的研究转向了卷积神经网络 (CNN) 来进行基于事件的强度重建,其中对事件和强度图像进行端到端训练。基于 CNN 的方法通常能够降低噪声。然而,人工设计的网络往往缺乏物理均值,因此难以同时处理事件和 APS 帧。

除了噪声问题外,目前迫切需要一种超分辨率算法来进一步提高人脸识别等高级视觉任务的强度重建,但这方面的研究进展甚少。一种综合的方法将是更期待的。

目前很少有研究能够同时解决上述三个任务,留下一个有待解决的问题:是否有可能找到一个统一的框架来同时考虑去噪、去模糊和超分辨率? 为了回答这个问题,本文采用稀疏学习来解决这三个任务。一般的低分辨率、噪声、模糊图像的退化模型,往往假设整个图像具有相同的模糊核。然而,事件记录的强度变化在一个非常高的时间分辨率,这可以有效地增强代表运动模糊效果的退化模型。增强的退化模型提供了一条从 APS 帧及其事件序列中恢复 HR 清晰的潜在图像的途径。可以通过将模型转换为稀疏学习的框架来求解该模型,使其有抗噪能力。本文提出了 eSL-Net 为事件相机恢复高质量的图像。特别地,合成数据集训练的 eSL-Net 可以推广到真实场景中,并且不需要额外的训练过程,eSL-Net 可以很容易地扩展,通过转换事件序列生成高帧率的视频。

 

 

Related Works

  • Event-based Intensity Reconstruction:

Early attempts of reconstructing intensity from pure events are commonly based on the assumption of brightness constancy, i.e. static scenes [15]. The intensity reconstruction is then addressed by simultaneously estimating the camera movement, optical flow and intensity gradient [16]. In [6], Cook et al. propose a bio-inspired and interconnected network to simultaneously reconstruct intensity frames, optical flow and angular velocity for small rotation movements. Later on, Bardow et. al [2] formulate the intensity change and optical flow in a unified variational energy minimization framework. By optimization, one can simultaneously reconstruct the video frames together with the optical flow. On the other hand, another research line on intensity reconstruction is the direct event integration method [30,19,25], which does not rely on any assumption about the scene structure or motion dynamics.

早期从纯事件中重建强度的尝试通常基于亮度恒定的假设,即静态场景。通过同时估计相机运动、光流和强度梯度来实现强度图像重建。

另一个研究方向是直接事件积分法进行强度图像重建,该方法不依赖于对场景结构或运动动力学的任何假设。

 

While the APS frames contain relatively abundant textures, events and APS frames can be used as complementary sources for event-based intensity reconstruction. In [30], events are approximated as the time differential of intensity frames. Based on this, a complementary filter is proposed as a fusion engine and nearly continuous-time intensity frames can be generated. Pan et. al [25] have proposed an event-based deblurring approach by relating blurry APS frames and events with an event-based double integration (EDI) model. Afterwards, a multiple-frame EDI model is proposed for high-rate video reconstruction by further considering frame-to-frame relations [24].

而 APS 帧包含了相对丰富的纹理,事件和 APS 帧可以作为基于事件的强度重建的互补源。在 [30] 中,事件近似为强度帧的时间微分。在此基础上,提出了一种互补滤波器作为融合引擎,可以生成几乎连续时间的强度帧。Pan 等人 [25] 提出了一种基于事件的去模糊方法,将模糊的 APS 帧和事件与基于事件的双集成 (EDI) 模型关联起来。在此基础上,进一步考虑帧与帧之间的关系,提出了一种用于高速视频重建的多帧 EDI 模型。

 

  • Event-based Super-resolution:

Even though event cameras have extremely high temporal frequency, the spatial (pixel) resolution is relative low and yet not easy to be resolved physically [12]. Few of progress has been made to event-based super-resolution. To the best of our knowledge, only one very recent work [5], called SRNet, has been released when we are preparing this manuscript. Comparing to SRNet, our proposed approach differs in the following aspects: (1) we proposed a unified framework to simultaneously resolve the tasks including denoising, deblurring and superresolution, while SRNet [5] cannot directly deal with blurring or noisy inputs; (2) the proposed network is completely interpretable with meaningful intermediate processes; (3) our framework reconstructs the intensity frame by fusing events and APS frames, while SRNet is proposed for reconstruction from pure events.

虽然事件相机具有极高的时间频率,但空间 (像素) 分辨率相对较低,而且不容易在物理上解决。基于事件的超分辨率技术进展甚微。只有一个最近的工作 [5],称为 SRNet。与之相比,本文的方法在以下几个方面有所不同:

(1) 本文提出了一个统一的框架来同时解决去噪、去模糊和超分辨率等任务,而 SRNet 不能直接处理模糊或有噪声的输入;

(2) 所提网络具有有意义的中间过程,完全可解释;

(3) 本文的框架是通过融合事件和 APS 帧来重构强度帧,而 SRNet 是针对纯事件进行重构。

 

Problem Statement

Events and Intensity Images

  • 事件的数学表达

Event camera triggers events whenever the logarithm of the intensity changes over a pre-setting threshold \small c

where \small I_{xy}(t) and \small I_{xy}(t-\Delta t) denote the instantaneous intensities at time \small t and \small t -\Delta t for a specific pixel location \small (x, y), \small \Delta t is the time since the last event at this pixel location, \small p \in \{+1,-1\} is the polarity representing the direction (increase or decrease) of the intensity change. Consequently, an event is made up of \small (x, y, t, p).

In order to facilitate expression of events, for every location \small (x, y) in the image, we define \small e_{xy}(t) as a function of continuous time \small t such that:

whenever there is an event \small (x, y, t_0, p). Here, \small \delta (\cdot) is the Dirac function. As a result, a sequence of discrete events is turned into a continuous time signal.

(这段不解释了)

  • 强度图像是数学表达

In addition to event sequence, many event cameras e.g., DAVIS [4], can provide grey-scale intensity images simultaneously with slower frame-rate. And mathematically, the f-th frame of the observed intensity image \small Y [f] during the exposure interval \small [t_f , t_f +T] could be modeled as an average of sharp clear latent intensity images I(t)

  • 强度图与事件之间的关系

Suppose that \small I_{xy}(t_r) is the sharp clear latent intensity image at any time \small t_r\in \small [t_f , t_f +T], we have the following relationship according to (1) and (2), , then

Since each pixel can be treated separately, subscripts x, y are often omitted henceforth.

Finally, considering the whole pixels, we can get a simple model connecting events, the observed intensity image and the latent intensity image:

with being double integral of events at time \small t_r and denoting the Hadamard product.

这里的 Hadamard product 就是元素相乘:

 

Event Enhanced Degeneration Model

Practically, the non-ideality of sensors and the relative motion between cameras and target scenes may largely degrade the quality of the observed intensity image \small Y [f] and make it noisy and blurry. Moreover, even though event cameras have extremely high temporal resolution, the spatial pixel resolution is relatively low due to the physical limitations. With these considerations, (5) becomes:

with ε the measuring noise which can be assumed to be white Gaussian, \small P the down-sampling operator and \small X(t_r) the latent clear image with high-resolution (HR) at time \small t_r. Consequently, (6) is the degeneration model where events are exploited to introduce the motion information.

Given the observed image \small Y [f], the corresponding triggered events and the specified time \small t_r\in \small [t_f , t_f +T], our goal is to reconstruct a high quality intensity image \small X at time \small t_r. Obviously, it is a multi-task and ill-posed problem where denoising, deblurring and super-resolution should be addressed simultaneously.

实际上,传感器的非理想性以及相机和目标场景之间的相对运动可能会在很大程度上降低观察到的强度图像的质量,并使其噪声和模糊。此外,尽管事件相机具有极高的时间分辨率,但由于物理限制,空间像素分辨率相对较低。考虑到这些因素,(5) 变成 (6)。

考虑到观测图像 \small Y [f],在指定的时间相应的触发事件和 \small t_r\in \small [t_f , t_f +T]。本文的目标是重建在时间 \small t_r 的高质量的强度图像 \small X 。显然,这是一个多任务和不适定的问题,去噪,去模糊和超分辨率应该同时解决。

 

Event Enhanced High-Quality Image Recovery

Event-Enhanced Sparse Learning

Many methods were proposed for image denoising, deblurring and SR [37,11,23,9]. However, most of them can not be applied for event cameras directly due to the asynchronous data-driven mechanism. Thanks to the sparse learning, we could integrate the events into sparsity framework and reconstruct satisfactory images to solve the aforementioned problems.

许多方法被提出用于图像去噪、去模糊和 SR。但由于采用的是异步数据驱动机制,其中大部分不能直接应用于事件摄像机。通过稀疏学习,可以将事件整合到稀疏框架中,重构出满意的图像来解决上述问题。

 

In this section, the expression of the time \small t_r and frame index \small f is temporally removed for simplicity. Then we arrange the image matrices as column vectors, i.e., \small Y \in R^{N\times 1} , I \in R^{N\times 1} , \varepsilon \in R^{N\times 1}~ and~ X \in R^{sN\times 1} , thus the blurring operator can be represented as \small E = diag(e_1, e_2, . . . , e_N ) \in R^{N\times N}, where \small e_1, e_2, . . . , e_N are the elements of original blurring operator, so does \small P \in R^{N\times sN} , where \small s denotes the downsampling scale factor and \small N denotes the product of height \small H and width \small W of the observed image \small Y. Then, according to (6), we have:

The reconstruction from the observed image \small Y to HR sharp clear image \small X is highly ill-posed since the inevitable loss of information in the image degeneration process. Inspired by the success of Compressed Sensing [10], we assume that LR sharp clear image \small I and HR sharp clear image \small X can be sparsely represented on LR dictionary \small D_I and HR dictionary \small D_X, i.e., \small I = D_I \alpha_I and \small X = D_X \alpha _X where \small \alpha _I and \small \alpha _X are known as sparse codes. Since the downsampling operator \small P is linear, LR sharp clear image \small I and HR sharp clear image \small X can share the same sparse code, i.e. \small \alpha =\alpha_ I = \alpha_ X if the dictionaries \small D_I and \small D_X are defined properly. Therefore, given an observed image \small Y, we first need to find its sparse code on \small D_I by solving the LASSO [32] problem below:

几个符号:

\small I :低分辨率理想(清晰)强度图;

\small Y:低分辨率真实(噪声、模糊)强度图;

\small X:希望复原的高分辨率清晰图像。

从观察到的图像 \small Y 到 HR 清晰的图像 \small X 的重建是高度病态的,因为在图像退化过程中不可避免地会丢失信息。采用压缩感知,假设 LR 锐化清除图像 \small I 和 HR 锐化清晰图像 \small X 可以代表 LR 稀疏字典 \small D_I 和 HR 字典 \small D_X,例如,\small I = D_I \alpha_I\small X = D_X \alpha _X\small \alpha _I and \small \alpha _X 值被称为稀疏编码。由于下采样运算符 \small P 是线性的,LR sharp clear image \small I 和 HR sharp clear image \small X 可以共享相同的稀疏代码,即,如果字典 \small D_I and \small D_X 定义正确的话,\small \alpha =\alpha_ I = \alpha_ X。因此,给定一个观察到的图像 \small Y,首先需要通过解决(8)的 LASSO [32] 问题,找到它在 \small D_I 上的稀疏编码。

 

To solve (8), a common approach is to use iterative shrinkage thresholding algorithm (ISTA) [7]. At the n-th iteration, the sparse code is updated as:

where L is the Lipschitz constant, \small \Gamma _{\theta }(\beta ) = sign(\beta )max(|\beta | - \theta , 0) denotes the element-wise soft thresholding function. After obtaining the optimum solution of sparse code \small \alpha ^{\ast } , we could finally recover HR sharp clear image \small X by:

where \small D_X is the HR dictionary.

(9)是稀疏编码的解,最后通过(10)进行图像复原。

 

Network

Inspired by [14], we can solve the sparse coding problem efficiently by integrating it into the CNN architecture. Therefore we propose an Event-enhanced Sparse Learning Net (eSL-Net) to solve problems of noise, motion blur and low spatial resolution in a unified framework.

The basic idea of eSL-Net is to map the update steps of event-based intensity reconstruction method to a deep network architecture that consists of a fixed number of phases, each of which corresponds to one iteration of (9). Therefore eSL-Net is an interpretable deep network.

本文用深度学习模型,解决稀疏编码问题,提出了 Event-enhanced Sparse Learning Net (eSL-Net) 网络。

eSL-Net 的基本思想是将基于事件的强度重构方法的更新步骤映射到一个由固定数量的阶段组成的深度网络架构,每个阶段对应于 (9) 的一次迭代,因此,eSL-Net 是一个可解释的深度网络。

The whole eSL-Net architecture is shown as Fig. 2. Obviously the most attractive part in the network is iteration module corresponding to (9) in the green box. According to [26], when the coefficient in (9) is limited to nonnegative, ISTA is not affected. It is easy to find the equality of the soft nonnegative thresholding operator \small \Gamma _{\theta } and the ReLU activation function. We use ReLU layer to implement \small \Gamma _{\theta }. Convolution is a special kind of matrix multiplication, therefore we use convolution layers to implement matrix multiplication. Then the plus node in the green box with three inputs represents in (9).

According to (5), \small E is double integral of events. In discrete case, the continuous integral turns into discrete summation. More generally, we use the weighted summation, convolution, to replace integral. As a result, through two convolution layers with suitable parameters, the event sequence input can be transformed to approximative \small E. What’s more, convolution has some de-noise effect on event sequences.

Finally, the output of the iterative module, optimum sparse encoding \small \alpha ^{\ast } , is passed through a HR dictionary according to (10). In eSL-Net, we use convolution layers followed by shuffle layer to implement HR dictionary \small D_X, due to the fact that the shuffle operator, arranging the pixels of different channels, can be regarded as a linear operator.

Fig. 2. eSL-Net Framework

整个 eSL-Net 体系结构如图 2 所示。仔细看,其实就是公式(9),其中的一些实现是用卷积和 ReLU 进行替换的。整个这段都是在讲这个东西。

使用 ReLU 层实现 \small \Gamma _{\theta }。卷积是一种特殊的矩阵乘法,因此我们使用卷积层来实现矩阵乘法。然后绿色框中有三个输入的+节点在(9) 中表示


最后的输出,使用卷积层和 shuffle 层来实现 HR 字典 \small D_X,因为 shuffle 操作符可以看作是一个线性操作符,将不同通道的像素进行排列。


 

 

Dataset Preparation

为了训练所提出的 eSL-Net,需要大量 LR 模糊噪声图像和相应的 HR ground-truth 图像和事件序列。然而,没有这样大规模的数据集。因此要合成一个具有 LR 模糊噪声图像和相应的 HR 清晰图像和事件的新数据集。第 7 节显示,尽管本文的模型是在合成数据上进行训练的,但 eSL-Net 能够推广到真实场景中。

数据包括四个图像:

HR clear images: We choose the continuous sharp clear images with resolution of 1280 × 720 from GoPro dataset [20] as our ground truth.

LR clear images: LR sharp clear images with resolution of 320 × 180 are obtained by sampling HR clear images with bicubic interpolation, that are used as ground truth for the eSL-Net without SR.

LR blurry images: The GoPro dataset [20] also provides LR blurry images, but we have to regenerate them due to the ignorance of exposure time. Mathematically, during the exposure, a motion blurry image can be simulated by averaging a series of sharp images at a high frame rate [21]. However, when the frame rate is insufficient, e.g. 120 fps in GoPro [20], simple time averaging would lead to unnatural spikes or steps in the blur trajectory [36]. To avoid this issue, we first increase the frame-rate of LR sharp clear images to 960 fps by the method in [22], and then generate LR blurry images by averaging 17 continuous LR sharp clear images. Besides, to better simulate the real situation, we add additional white Gaussian noise with standard deviation σ = 4 (σ = 4 is the approximate mean of the standard deviations of many smooth patches in APS frames in the real dataset) to the LR blurry images.

Event sequence: To simulate events, we resort to the open ESIM [28] which can generate events from a sequence of input images. For a given LR blurry image, we input the corresponding LR sharp clear images (960 fps) and obtain the corresponding event sequence. We add 30% (30% is artificially calculated approximate ratio of noise events to effective events in simple real scenes) noisy events with uniform random distribution to the sequence.

HR 清晰图像:从 GoPro dataset [20] 获得;

LR 清晰图像:对 GoPro dataset 图像进行下采样;

LR 模糊图像:将 17 张连续的 LR 清晰图像平均,得到运动模糊图像;再加入噪声,模拟真实场景;

事件序列:用 ESIM 对 LR 模糊图像处理获得。

 

 

 

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!