No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects
不再有步幅卷积或池化:用于低分辨率图像和小物体的新CNN构建块
Abstract
Convolutional neural networks (CNNs) have made resounding success in many computer vision tasks such as image classification and object detection. However, their performance degrades rapidly on tougher tasks where images are of low resolution or objects are small. In this paper, we point out that this roots in a defective yet common design in existing CNN architectures, namely the use of strided convolution and/or pooling layers, which results in a loss of fine-grained information and learning of less effective feature representations. To this end, we propose a new CNN building block called SPD-Conv in place of each strided convolution layer and each pooling layer (thus eliminates them altogether). SPD-Conv is comprised of a space-to-depth (SPD) layer followed by a non-strided convolution (Conv) layer, and can be applied in most if not all CNN architectures. We explain this new design under two most representative computer vision tasks: object detection and image classification. We then create new CNN architectures by applying SPD-Conv to YOLOv5 and ResNet, and empirically show that our approach significantly outperforms state-of-the-art deep learning models, especially on tougher tasks with low-resolution images and small objects.卷积神经网络(CNN)在许多计算机视觉任务中取得了巨大的成功,如图像分类和目标检测。然而,在图像分辨率较低或对象较小的更严格的任务中,它们的性能会迅速下降。在本文中,我们指出,这根源于现有CNN架构中有缺陷但常见的设计,即使用跨卷积和/或池化层,这导致细粒度信息的丢失和学习效率较低的特征表示。为此,我们提出了一个名为SPD-Conv的新CNN构建块,以代替每个跨步卷积层和每个池化层(从而完全消除它们)。SPD-Conv由空间到深度(SPD)层和非跨越卷积(Conv)层组成,可以应用于大多数CNN架构。我们在两个最具代表性的计算机视觉任务下解释这种新设计:目标检测和图像分类。然后,我们通过将SPD-Conv应用于YOLOv 5和ResNet来创建新的CNN架构,并根据经验表明,我们的方法显著优于最先进的深度学习模型,特别是在低分辨率图像和小对象的更艰巨任务上。
We have open-sourced our code at https://github.com/LabSAINT/SPD-Conv.
1 Introduction
Since AlexNet [18], convolutional neural networks (CNNs) have excelled at many computer vision tasks. For example in image classification, well-known CNN models include AlexNet, VGGNet [30], ResNet [13], etc.; while in object detection, those models include the R-CNN series [9,28], YOLO series [26,4], SSD [24], EfficientDet [34], and so on. However, all such CNN models need “good quality” inputs (fine images, medium to large objects) in both training and inference. For example, AlexNet was originally trained and evaluated on 227×227 clear images, but after reducing the image resolution to 1/4 and 1/8, its classification accuracy drops by 14% and 30%, respectively [16]. The similar observation was made on VGGNet and ResNet too [16]. In the case of object detection, SSD suffers from a remarkable mAP loss of 34.1 on 1/4 resolution images or equivalently 1/4 smaller-size objects, as demonstrated in [11]. In fact, small object detection is a very challenging task because smaller objects inherently have lower resolution, and also limited context information for a model to learn from. Moreover, they often (unfortunately) co-exist with large objects in the same image, which (the large ones) tend to dominate the feature learning process, thereby making the small objects undetected. smaller-size objects, as demonstrated in [11]. In fact, small object detection is a very challenging task because smaller objects inherently have lower resolution, and also limited context information for a model to learn from. Moreover, they often (unfortunately) co-exist with large objects in the same image, which (the large ones) tend to dominate the feature learning process, thereby making the small objects undetected.自AlexNet [18]以来,卷积神经网络(CNN)在许多计算机视觉任务中表现出色。例如在图像分类中,众所周知的CNN模型包括AlexNet,VGGNet [30],ResNet [13]等;而在对象检测中,这些模型包括R-CNN系列[9,28],YOLO系列[26,4],SSD [24],EfficientDet [34]等。然而,所有这些CNN模型在训练和推理中都需要“高质量”的输入(精细图像,中等到大型对象)。例如,AlexNet最初是在227×227清晰图像上训练和评估的,但在将图像分辨率降低到14和1/8后,其分类准确率分别下降了14%和30%。在VGGNet和ResNet上也进行了类似的观察[16]。在物体检测的情况下,SSD在1/4分辨率图像或等效的1/4较小尺寸物体上遭受34.1的显著mAP损失,如[11]所示。事实上,小对象检测是一项非常具有挑战性的任务,因为较小的对象本身具有较低的分辨率,并且模型学习的上下文信息也有限。此外,它们通常(不幸地)与同一图像中的大对象共存,这(大对象)往往主导特征学习过程,从而使小对象无法检测到。如图11所示,小尺寸物体。事实上,小对象检测是一项非常具有挑战性的任务,因为较小的对象本身具有较低的分辨率,并且模型学习的上下文信息也有限。此外,它们通常(不幸地)与同一图像中的大对象共存,这(大对象)往往主导特征学习过程,从而使小对象无法检测到。
In this paper, we contend that such performance degradation roots in a defective yet common design in existing CNNs. That is, the use of strided convolution and/or pooling, especially in the earlier layers of a CNN architecture. The adverse effect of this design usually does not exhibit because most scenarios being studied are “amiable” where images have good resolutions and objects are in fair sizes; therefore, there is plenty of redundant pixel information that strided convolution and pooling can conveniently skip and the model can still learn features quite well. However, in tougher tasks when images are blurry or objects are small, the lavish assumption of redundant information no longer holds and the current design starts to suffer from loss of fine-grained information and poorly learned features.在本文中,我们认为这种性能下降根源于现有CNN中有缺陷但常见的设计。也就是说,使用步幅卷积和/或池化,特别是在CNN架构的早期层中。这种设计的不利影响通常不会表现出来,因为正在研究的大多数场景都是“友好的”,其中图像具有良好的分辨率,对象大小适中;因此,有大量的冗余像素信息,跨卷积和池化可以方便地跳过,模型仍然可以很好地学习特征。然而,在图像模糊或对象很小的更艰巨的任务中,冗余信息的奢侈假设不再成立,当前的设计开始遭受细粒度信息和学习不足的特征的损失。
To address this problem, we propose a new building block for CNN, called SPD-Conv, in substitution of (and thus eliminate) strided convolution and pooling layers altogether. SPD-Conv is a space-to-depth (SPD) layer followed by a non-strided (i.e., vanilla) convolution layer. The SPD layer downsamples a feature map X but retains all the information in the channel dimension, and thus there is no information loss. We were inspired by an image transformation technique [29] which rescales a raw image before feeding it into a neural net, but we substantially generalize it to downsampling feature maps inside and throughout the entire network; furthermore, we add a non-strided convolution operation after each SPD to reduce the (increased) number of channels using learnable parameters in the added convolution layer. Our proposed approach is both general and unified, in that SPD-Conv (i) can be applied to most if not all CNN architectures and (ii) replaces both strided convolution and pooling the same way. In summary, this paper makes the following contributions:为了解决这个问题,我们为CNN提出了一个新的构建块,称为SPD-Conv,以取代(从而消除)跨越卷积和池化层。SPD-Conv是空间到深度(SPD)层,随后是非跨步(即,vanilla)卷积层。SPD层对特征图X进行下采样,但保留通道维度中的所有信息,因此没有信息丢失。我们受到图像变换技术的启发[29],该技术在将原始图像馈送到神经网络之前重新缩放原始图像,但我们将其基本上推广到整个网络内部和整个网络中的下采样特征图;此外,我们在每个SPD之后添加了一个非跨步卷积操作,以使用添加的卷积层中的可学习参数来减少(增加的)通道数量。我们提出的方法是通用和统一的,因为SPD-Conv(i)可以应用于大多数(如果不是所有)CNN架构,并且(ii)以相同的方式取代步幅卷积和池化。综上所述,本文做出了以下贡献:
- We identify a defective yet common design in existing CNN architectures and propose a new building block called SPD-Conv in lieu of the old design. SPD-Conv downsamples feature maps without losing learnable information, completely jettisoning strided convolution and pooling operations which are widely used nowadays.
我们在现有的CNN架构中发现了一种有缺陷但常见的设计,并提出了一种名为SPD-Conv的新构建块来代替旧设计。SPD-Conv在不丢失可学习信息的情况下对特征图进行下采样,完全抛弃了当今广泛使用的跨卷积和池化操作。
- SPD-Conv represents a general and unified approach, which can be easily applied to most if not all deep learning based computer vision tasks.
SPD-Conv代表了一种通用和统一的方法,可以轻松应用于大多数(如果不是所有)基于深度学习的计算机视觉任务。
- Using two most representative computer vision tasks, object detection and image classification, we evaluate the performance of SPD-Conv. Specifically, we construct YOLOv5-SPD, ResNet18-SPD and ResNet50-SPD, and evaluate them on COCO-2017, Tiny ImageNet, and CIFAR-10 datasets in comparison with several state-of-the-art deep learning models. The results demonstrate significant performance improvement in AP and top-1 accuracy, especially on small objects and low-resolution images. See Fig. 1 for a preview.
使用两个最具代表性的计算机视觉任务,目标检测和图像分类,我们评估的性能SPD-Conv。具体来说,我们构建了YOLOv 5-SPD、ResNet 18-SPD和ResNet 50-SPD,并在COCO-2017、Tiny ImageNet和CIFAR-10数据集上与几个最先进的深度学习模型进行了比较。结果表明,AP和top-1精度的性能显著提高,特别是在小物体和低分辨率图像上。预览见图1。
Fig. 1: Comparing AP for small objects (APS). “SPD” indicates our approach.
图1:比较小物体的AP(APS)。“SPD”表示我们的方法。
- SPD-Conv can be easily integrated into popular deep learning libraries such as PyTorch and TensorFlow, potentially producing greater impact.
SPD-Conv可以很容易地集成到流行的深度学习库中,如PyTorch和TensorFlow,可能会产生更大的影响。
Our source
code is available at https://github.com/LabSAINT/SPD-Conv.
The rest of this paper is organized as follows. Section 2 presents background and reviews related work. Section 3 describes our proposed approach and Section 4 presents two case studies using object detection and image classification. Section 5 provides performance evaluation. This paper concludes in Section 6.本文的其余部分组织如下。第2节介绍背景并回顾相关工作。第3节介绍了我们提出的方法,第4节介绍了两个案例研究,使用对象检测和图像分类。第5节提供了业绩评价。本文在第6节结束。