Swin Transformer:使用移位窗口的分层视觉Transformer
Abstract
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-theart by a large margin of+2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.本文提出了一种新的视觉Transformer,称为Swin Transformer,它能够作为计算机视觉的通用骨干。将Transformer从语言适应到视觉的挑战来自于两个领域之间的差异,例如视觉实体的规模变化很大,图像中的像素与文本中的单词相比分辨率很高。为了解决这些差异,我们提出了一个分层的Transformer,其表示是用移位窗口计算的。移位加窗方案通过将自注意力计算限制在非重叠的局部窗口同时还允许跨窗口连接来带来更高的效率。这种分层架构具有在各种尺度下建模的灵活性,并且具有相对于图像大小的线性计算复杂度。Swin Transformer的这些特性使其与广泛的视觉任务兼容,包括图像分类(ImageNet-1 K上的87.3 top-1精度)和密集预测任务,如对象检测(COCO testdev上的58.7 box AP和51.1 mask AP)和语义分割(ADE 20 K上的53.5 mIoU瓦尔)。它的性能超过了之前的最先进水平,在COCO上为+2.7框AP和+2.6掩模AP,在ADE 20 K上为+3.2 mIoU,证明了基于Transformer的模型作为视觉骨干的潜力。分层设计和移动窗口的方法也证明有利于全MLP架构。代码和模型在https://github.com/microsoft/Swin-Transformer上公开。
1. Introduction
Modeling in computer vision has long been dominated by convolutional neural networks (CNNs). Beginning with AlexNet [35] and its revolutionary performance on the ImageNet image classification challenge, CNN architectures have evolved to become increasingly powerful through greater scale [27, 69], more extensive connections [31], and more sophisticated forms of convolution [64, 17, 75]. With CNNs serving as backbone networks for a variety of vision tasks, these architectural advances have led to performance improvements that have broadly lifted the entire field.计算机视觉建模长期以来一直由卷积神经网络(CNN)主导。从AlexNet [35]及其在ImageNet图像分类挑战中的革命性性能开始,CNN架构已经发展到通过更大的规模[27,69],更广泛的连接[31]和更复杂的卷积形式[64,17,75]变得越来越强大。随着CNN作为各种视觉任务的骨干网络,这些架构上的进步导致了性能的提高,从而广泛地提升了整个领域。
On the other hand, the evolution of network architectures in natural language processing (NLP) has taken a different path, where the prevalent architecture today is instead the Transformer [58]. Designed for sequence modeling and transduction tasks, the Transformer is notable for its use of attention to model long-range dependencies in the data. Its tremendous success in the language domain has led researchers to investigate its adaptation to computer vision, where it has recently demonstrated promising results on certain tasks, specifically image classification [19] and joint vision-language modeling [43].另一方面,自然语言处理(NLP)中的网络架构的发展已经走上了一条不同的道路,今天流行的架构是Transformer [58]。Transformer专为序列建模和转换任务而设计,它以关注数据中的长距离依赖性建模而闻名。它在语言领域的巨大成功促使研究人员研究它对计算机视觉的适应性,最近它在某些任务上表现出了有希望的结果,特别是图像分类[19]和联合视觉语言建模[43]。
Figure 1. (a) The proposed Swin Transformer builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. (b) In contrast, previous vision Transformers [19] produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of selfattention globally.
图1.(a)所提出的Swin Transformer通过在更深层中合并图像块(以灰色示出)来构建分层特征映射,并且由于仅在每个局部窗口(以红色示出)内计算自我注意,因此具有与输入图像大小线性的计算复杂度。因此,它可以作为图像分类和密集识别任务的通用骨干。(b)相比之下,以前的视觉变换器[19]产生单个低分辨率的特征图,并且由于全局自注意力的计算而具有输入图像大小的二次计算复杂度。
In this paper, we seek to expand the applicability of Transformer such that it can serve as a general-purpose backbone for computer vision, as it does for NLP and as CNNs do in vision. We observe that significant challenges in transferring its high performance in the language domain to the visual domain can be explained by differences between the two modalities. One of these differences involves scale. Unlike the word tokens that serve as the basic elements of processing in language Transformers, visual elements can vary substantially in scale, a problem that receives attention in tasks such as object detection [38, 49, 50]. In existing Transformer-based models [58, 19], tokens are all of a fixed scale, a property unsuitable for these vision applications. Another difference is the much higher resolution of pixels in images compared to words in passages of text. There exist many vision tasks such as semantic segmentation that require dense prediction at the pixel level, and this would be intractable for Transformer on high-resolution images, as the computational complexity of its self-attention is quadratic to image size. To overcome these issues, we propose a generalpurpose Transformer backbone, called Swin Transformer, which constructs hierarchical feature maps and has linear computational complexity to image size. As illustrated in Figure 1(a), Swin Transformer constructs a hierarchical representation by starting from small-sized patches (outlined in gray) and gradually merging neighboring patches in deeper Transformer layers. With these hierarchical feature maps, the Swin Transformer model can conveniently leverage advanced techniques for dense prediction such as feature pyramid networks (FPN) [38] or U-Net [47]. The linear computational complexity is achieved by computing self-attention locally within non-overlapping windows that partition an image (outlined in red). The number of patches in each window is fixed, and thus the complexity becomes linear to image size. These merits make Swin Transformer suitable as a general-purpose backbone for various vision tasks, in contrast to previous Transformer based architectures [19] which produce feature maps of a single resolution and have quadratic complexity.在本文中,我们试图扩展Transformer的适用性,使其可以作为计算机视觉的通用骨干,就像它在NLP和CNN在视觉中所做的那样。我们观察到,在将其在语言领域的高性能转移到视觉领域方面的重大挑战可以通过两种模式之间的差异来解释。这些差异之一涉及规模。与作为语言变形金刚中处理的基本元素的单词标记不同,视觉元素在尺度上可以有很大的变化,这是一个在物体检测等任务中受到关注的问题[38,49,50]。在现有的基于transformer的模型中[58,19],令牌都是固定规模的,这是一个不适合这些视觉应用的属性。另一个区别是图像中像素的分辨率比文本段落中的单词高得多。存在许多视觉任务,如语义分割,需要在像素级进行密集预测,这对于Transformer在高分辨率图像上是难以处理的,因为其自我注意力的计算复杂度是图像大小的二次方。为了克服这些问题,我们提出了一个通用的Transformer骨干,称为Swin Transformer,它构造分层特征映射,并具有线性计算复杂度的图像大小。如图1(a)所示,Swin Transformer从小尺寸的补丁(以灰色轮廓显示)开始,逐渐合并更深Transformer层中的相邻补丁,从而构建了一个分层表示。通过这些分层特征图,Swin Transformer模型可以方便地利用高级技术进行密集预测,例如特征金字塔网络(FPN)[38]或U-Net [47]。线性计算复杂度是通过在划分图像的非重叠窗口(红色轮廓)内局部计算自注意力来实现的。每个窗口中的补丁的数量是固定的,因此复杂度与图像大小成线性关系。这些优点使Swin Transformer适合作为各种视觉任务的通用骨干,与之前基于Transformer的架构[19]形成对比,后者产生单一分辨率的特征图并具有二次复杂性。
Figure 2. An illustration of the shifted window approach for computing self-attention in the proposed Swin Transformer architecture. In layer l (left), a regular window partitioning scheme is adopted, and self-attention is computed within each window. In the next layer l + 1 (right), the window partitioning is shifted, resulting in new windows. The self-attention computation in the new windows crosses the boundaries of the previous windows in layer l, providing connections among them.
图2.在Swin Transformer架构中计算自我注意力的移位窗口方法的说明。在层l(左)中,采用常规窗口划分方案,并且在每个窗口内计算自注意。在下一层l + 1(右)中,窗口分区被移位,从而产生新的窗口。新窗口中的自注意力计算跨越层l中的先前窗口的边界,提供它们之间的连接。
A key design element of Swin Transformer is its shift of the window partition between consecutive self-attention layers, as illustrated in Figure 2. The shifted windows bridge the windows of the preceding layer, providing connections among them that significantly enhance modeling power (see Table 4). This strategy is also efficient in regards to real-world latency: all query patches within a window share the same key set, which facilitates memory access in hardware. In contrast, earlier sliding window based self-attention approaches [30, 46] suffer from low latency on general hardware due to different key sets for different query pixels. Our experiments show that the proposed shifted window approach has much lower latency than the sliding window method, yet is similar in modeling power (see Tables 5 and 6). The shifted window approach also proves beneficial for all-MLP architectures [56].Swin Transformer的一个关键设计元素是在连续的自我关注层之间移动窗口分区,如图2所示。移动的窗口桥接了前一层的窗口,提供了它们之间的连接,从而显著增强了建模能力(参见表4)。这种策略在现实世界的延迟方面也很有效:窗口中的所有查询补丁共享相同的密钥集,这有助于硬件中的内存访问。相比之下,早期的基于滑动窗口的自注意方法[30,46]由于不同查询像素的不同密钥集而在一般硬件上具有低延迟。我们的实验表明,所提出的移位窗口方法具有比滑动窗口方法低得多的延迟,但在建模能力方面相似(参见表5和表6)。移位窗口方法也证明对全MLP架构有益[56]。
The proposed Swin Transformer achieves strong performance on the recognition tasks of image classification, object detection and semantic segmentation. It outperforms the ViT / DeiT [19, 57] and ResNe(X)t models [27, 64] significantly with similar latency on the three tasks. Its 58.7 box AP and 51.1 mask AP on the COCO test-dev set surpass the previous state-of-the-art results by +2.7 box AP (Copy-paste [23] without external data) and +2.6 mask AP (DetectoRS [42]). On ADE20K semantic segmentation, it obtains 53.5 mIoU on the val set, an improvement of +3.2 mIoU over the previous state-of-the-art (SETR [73]). It also achieves a top-1 accuracy of 87.3% on ImageNet-1K image classification.所提出的Swin Transformer在图像分类、目标检测和语义分割等识别任务上取得了很好的性能。它在三个任务上的延迟相似,显著优于ViT / DeiT [19,57]和ResNe(X)t模型[27,64]。其在COCO测试开发集上的58.7框AP和51.1掩模AP超过了之前的最先进结果,分别为+2.7框AP(无外部数据的复制-粘贴[23])和+2.6掩模AP(DetectoRS [42])。在ADE 20 K语义分割上,它在瓦尔集上获得了53.5 mIoU,比之前的最新技术水平(SETR [73])提高了+3.2 mIoU。它还在ImageNet-1 K图像分类中达到了87.3%的前1准确率。
It is our belief that a unified architecture across computer vision and natural language processing could benefit both fields, since it would facilitate joint modeling of visual and textual signals and the modeling knowledge from both domains can be more deeply shared. We hope that Swin Transformer’s strong performance on various vision problems can drive this belief deeper in the community and encourage unified modeling of vision and language signals.我们相信,跨计算机视觉和自然语言处理的统一架构可以使这两个领域受益,因为它将促进视觉和文本信号的联合建模,并且可以更深入地共享这两个领域的建模知识。我们希望Swin Transformer在各种视觉问题上的出色表现能够在社区中加深这种信念,并鼓励视觉和语言信号的统一建模。
2. Related Work
CNN and variantsCNN及其变体
CNNs serve as the standard network model throughout computer vision. While the CNN has existed for several decades [36], it was not until the introduction of AlexNet [35] that the CNN took off and became mainstream. Since then, deeper and more effective convolutional neural architectures have been proposed to further propel the deep learning wave in computer vision, e.g., VGG [48], GoogleNet [53], ResNet [27], DenseNet [31], HRNet [59], and EfficientNet [54]. In addition to these architectural advances, there has also been much work on improving individual convolution layers, such as depth-wise convolution [64] and deformable convolution [17, 75]. While the CNN and its variants are still the primary backbone architectures for computer vision applications, we highlight the strong potential of Transformer-like architectures for unified modeling between vision and language. Our work achieves strong performance on several basic visual recognition tasks, and we hope it will contribute to a modeling shift.CNN是整个计算机视觉的标准网络模型。虽然CNN已经存在了几十年[36],但直到AlexNet的引入[35],CNN才起飞并成为主流。从那时起,人们提出了更深入、更有效的卷积神经架构,以进一步推动计算机视觉中的深度学习浪潮,例如,[48]、VGG [48]、GoogleNet [53]、ResNet [27]、DenseNet [31]、HRNet [59]和EfficientNet [54]。除了这些架构上的进步,还有很多关于改进单个卷积层的工作,例如深度卷积[64]和可变形卷积[17,75]。虽然CNN及其变体仍然是计算机视觉应用的主要骨干架构,但我们强调了类transformer架构在视觉和语言之间统一建模的强大潜力。我们的工作在几个基本的视觉识别任务上实现了强大的性能,我们希望它将有助于建模的转变。
Self-attention based backbone architectures基于自注意力的骨干网架构
Also inspired by the success of self-attention layers and Transformer architectures in the NLP field, some works employ self-attention layers to replace some or all of the spatial convolution layers in the popular ResNet [30, 46, 72]. In these works, the self-attention is computed within a local window of each pixel to expedite optimization [30], and they achieve slightly better accuracy/FLOPs trade-offs than the counterpart ResNet architecture. However, their costly memory access causes their actual latency to be significantly larger than that of the convolutional networks [30]. Instead of using sliding windows, we propose to shift windows between consecutive layers, which allows for a more efficient implementation in general hardware.同样受到自注意层和Transformer架构在NLP领域的成功启发,一些作品采用自注意层来取代流行的ResNet中的部分或全部空间卷积层[30,46,72]。在这些工作中,自注意力在每个像素的局部窗口内计算以加速优化[30],并且它们实现了比对应的ResNet架构稍好的精度/FLOPs权衡。然而,它们昂贵的内存访问导致它们的实际延迟明显大于卷积网络[30]。而不是使用滑动窗口,我们建议在连续层之间移动窗口,这允许在一般硬件中更有效地实现。
Self-attention/Transformers to complement CNNsSelf-attention/Transformers补充CNN
Another line of work is to augment a standard CNN architecture with self-attention layers or Transformers. The self-attention layers can complement backbones [61, 7, 3, 65, 21, 68, 51] or head networks [29, 24] by providing the capability to encode distant dependencies or heterogeneous interactions. More recently, the encoder-decoder design in Transformer has been applied for the object detection and instance segmentation tasks [8, 13, 76, 52]. Our work explores the adaptation of Transformers for basic visual feature extraction and is complementary to these works.另一项工作是用自注意层或变形金刚来增强标准CNN架构。自注意层可以通过提供编码远程依赖或异构交互的能力来补充骨干[61,7,3,65,21,68,51]或头部网络[29,24]。最近,Transformer中的编码器-解码器设计已被应用于对象检测和实例分割任务[8,13,76,52]。我们的工作探讨了变形金刚的适应基本的视觉特征提取,是这些作品的补充。
Transformer based vision backbones
Most related to our work is the Vision Transformer (ViT) [19] and its follow-ups [57, 66, 15, 25, 60]. The pioneering work of ViT directly applies a Transformer architecture on non-overlapping medium-sized image patches for image classification. It achieves an impressive speed-accuracy trade-off on image classification compared to convolutional networks. While ViT requires large-scale training datasets (i.e., JFT-300M) to perform well, DeiT [57] introduces several training strategies that allow ViT to also be effective using the smaller ImageNet-1K dataset. The results of ViT on image classification are encouraging, but its architecture is unsuitable for use as a general-purpose backbone network on dense vision tasks or when the input image resolution is high, due to its low-resolution feature maps and the quadratic increase in complexity with image size. There are a few works applying ViT models to the dense vision tasks of object detection and semantic segmentation by direct upsampling or deconvolution but with relatively lower performance [2, 73]. Concurrent to our work are some that modify the ViT architecture [66, 15, 25] for better image classification. Empirically, we find our Swin Transformer architecture to achieve the best speed-accuracy trade-off among these methods on image classification, even though our work focuses on general-purpose performance rather than specifically on classification. Another concurrent work [60] explores a similar line of thinking to build multi-resolution feature maps on Transformers. Its complexity is still quadratic to image size, while ours is linear and also operates locally which has proven beneficial in modeling the high correlation in visual signals [32, 22, 37]. Our approach is both efficient and effective, achieving state-of-the-art accuracy on both COCO object detection and ADE20K semantic segmentation.与我们的工作最相关的是Vision Transformer(ViT)[19]及其后续产品[57,66,15,25,60]。ViT的开创性工作直接将Transformer架构应用于非重叠的中等大小图像块,用于图像分类。与卷积网络相比,它在图像分类方面实现了令人印象深刻的速度-准确性权衡。虽然ViT需要大规模的训练数据集(即,JFT-300 M),DeiT [57]引入了几种训练策略,允许ViT使用较小的ImageNet-1 K数据集也是有效的。ViT在图像分类上的结果是令人鼓舞的,但其架构不适合用作密集视觉任务的通用骨干网络,或者当输入图像分辨率很高时,由于其低分辨率特征图和复杂性随图像大小的二次增加。有一些作品通过直接上采样或反卷积将ViT模型应用于对象检测和语义分割的密集视觉任务,但性能相对较低[2,73]。与我们的工作同时进行的是一些修改ViT架构[66,15,25]的工作,以实现更好的图像分类。从经验上讲,我们发现我们的Swin Transformer架构在图像分类的这些方法中实现了最佳的速度-精度权衡,即使我们的工作重点是通用性能,而不是专门的分类。另一个并行工作[60]探索了类似的思路,在变形金刚上构建多分辨率特征图。它的复杂性仍然是图像大小的二次方,而我们的是线性的,并且也在局部操作,这已经证明有利于模拟视觉信号中的高相关性[32,22,37]。我们的方法既高效又有效,在COCO对象检测和ADE 20 K语义分割方面都达到了最先进的准确性。
3. Method
3.1. Overall Architecture
3.1.整体架构
Figure 3. (a) The architecture of a Swin Transformer (Swin-T); (b) two successive Swin Transformer Blocks (notation presented with Eq. (3)). W-MSA and SW-MSA are multi-head self attention modules with regular and shifted windowing configurations, respectively.
图3.(a)Swin Transformer(Swin-T)的架构;(B)两个连续的Swin Transformer块(用等式1表示的符号);(三))。W-MSA和SW-MSA分别是具有规则和移位窗口配置的多头自注意模块。
An overview of the Swin Transformer architecture is presented in Figure 3, which illustrates the tiny version (SwinT). It first splits an input RGB image into non-overlapping patches by a patch splitting module, like ViT. Each patch is treated as a “token” and its feature is set as a concatenation of the raw pixel RGB values. In our implementation, we use a patch size of 4×4 and thus the feature dimension of each patch is 4 × 4 × 3 = 48. A linear embedding layer is applied on this raw-valued feature to project it to an arbitrary dimension (denoted as C).Swin Transformer体系结构的概述如图3所示,其中说明了微型版本(SwinT)。它首先将输入的RGB图像分割成不重叠的补丁由补丁分裂模块,如ViT。每个补丁都被视为一个“token”,其特征被设置为原始像素RGB值的串联。在我们的实现中,我们使用4×4的补丁大小,因此每个补丁的特征维度为4 × 4 × 3 = 48。线性嵌入层被应用于这个原始值特征,以将其投影到任意维度(表示为C)。
Several Transformer blocks with modified self-attention computation (Swin Transformer blocks) are applied on these patch tokens. The Transformer blocks maintain the number of tokens (H/4 × W/4 ), and together with the linear embedding are referred to as “Stage 1”.在这些补丁令牌上应用具有修改的自注意力计算的几个Transformer块(Swin Transformer块)。Transformer块保持令牌的数量(H/4 × W/4),并且与线性嵌入一起被称为“阶段1”。
To produce a hierarchical representation, the number of tokens is reduced by patch merging layers as the network gets deeper. The first patch merging layer concatenates the features of each group of 2 × 2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. This reduces the number of tokens by a multiple of 2×2 = 4 (2× downsampling of resolution), and the output dimension is set to 2C. Swin Transformer blocks are applied afterwards for feature transformation, with the resolution kept at H/8 × W/8 . This first block of patch merging and feature transformation is denoted as “Stage 2”. The procedure is repeated twice, as “Stage 3” and “Stage 4”, with output resolutions of H/16 × W/16 and H/32 × W/32 , respectively. These stages jointly produce a hierarchical representation, with the same feature map resolutions as those of typical convolutional networks, e.g., VGG [48] and ResNet [27]. As a result, the proposed architecture can conveniently replace the backbone networks in existing methods for various vision tasks.为了产生分层表示,随着网络的深入,通过补丁合并层来减少令牌的数量。第一个面片合并层连接每组2 × 2相邻面片的特征,并在4C维连接特征上应用线性层。这将令牌的数量减少了2×2 = 4的倍数(分辨率的2倍下采样),并且输出维度设置为2C。之后应用Swin Transformer块进行特征转换,分辨率保持在H/8 × W/8。将片合并和特征变换的该第一块表示为“阶段2”。该过程重复两次,作为“阶段3”和“阶段4”,输出分辨率分别为H/16 × W/16和H/32 × W/32。这些阶段共同产生分层表示,具有与典型卷积网络相同的特征图分辨率,例如,[27]第48话,因此,所提出的架构可以方便地取代现有方法中的骨干网络,用于各种视觉任务。
Swin Transformer blockSwin Transformer block
Swin Transformer is built by replacing the standard multi-head self attention (MSA) module in a Transformer block by a module based on shifted windows (described in Section 3.2), with other layers kept the same. As illustrated in Figure 3(b), a Swin Transformer block consists of a shifted window based MSA module, followed by a 2-layer MLP with GELU nonlinearity in between. A LayerNorm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module.Swin Transformer is built by replacing the standard multi-head self attention (MSA) module in a Transformer block by a module based on shifted windows (described in Section 3.2), with other layers kept the same. As illustrated in Figure 3(b), a Swin Transformer block consists of a shifted window based MSA module, followed by a 2-layer MLP with GELU nonlinearity in between. A LayerNorm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module.
3.2. Shifted Window based Self-Attention
3.2.基于移位窗口的自注意
The standard Transformer architecture [58] and its adaptation for image classification [19] both conduct global self-attention, where the relationships between a token and all other tokens are computed. The global computation leads to quadratic complexity with respect to the number of tokens, making it unsuitable for many vision problems requiring an immense set of tokens for dense prediction or to represent a high-resolution image.标准Transformer架构[58]及其对图像分类的适应[19]都进行全局自注意,其中计算令牌和所有其他令牌之间的关系。全局计算导致二次复杂度相对于令牌的数量,使其不适合许多视觉问题,需要一个巨大的令牌集的密集预测或代表一个高分辨率的图像。
Self-attention in non-overlapped windows非重叠窗口中的自我注意
For efficient modeling, we propose to compute self-attention within local windows. The windows are arranged to evenly partition the image in a non-overlapping manner. Supposing each window contains M×M patches, the computational complexity of a global MSA module and a window based one on an image of h ×w patches are:为了有效建模,我们建议在局部窗口内计算自我注意力。窗口被布置为以非重叠的方式均匀地划分图像。假设每个窗口包含M×M块,全局MSA模块和基于h ×w块图像的窗口的计算复杂度为:
where the former is quadratic to patch number hw, and the latter is linear when M is fixed (set to 7 by default). Global self-attention computation is generally unaffordable for a large hw, while the window based self-attention is scalable.其中,前者是补丁数hw的二次函数,而后者在M固定(默认设置为7)时是线性的。全局自注意计算通常对于大硬件是负担不起的,而基于窗口的自注意是可扩展的。
Shifted window partitioning in successive blocks连续块中的移位窗口划分
The window-based self-attention module lacks connections across windows, which limits its modeling power. To introduce cross-window connections while maintaining the efficient computation of non-overlapping windows, we propose a shifted window partitioning approach which alternates between two partitioning configurations in consecutive Swin Transformer blocks.基于窗口的自我注意模块缺乏跨窗口的连接,这限制了其建模能力。为了在保持非重叠窗口的有效计算的同时引入跨窗口连接,我们提出了一种移位窗口分区方法,该方法在连续的Swin Transformer块中的两个分区配置之间交替。
Figure 2. An illustration of the shifted window approach for computing self-attention in the proposed Swin Transformer architecture. In layer l (left), a regular window partitioning scheme is adopted, and self-attention is computed within each window. In the next layer l + 1 (right), the window partitioning is shifted, resulting in new windows. The self-attention computation in the new windows crosses the boundaries of the previous windows in layer l, providing connections among them.
图2.在Swin Transformer架构中计算自我注意力的移位窗口方法的说明。在层l(左)中,采用常规窗口划分方案,并且在每个窗口内计算自注意。在下一层l + 1(右)中,窗口分区被移位,从而产生新的窗口。新窗口中的自注意力计算跨越层l中的先前窗口的边界,提供它们之间的连接。
As illustrated in Figure 2, the first module uses a regular window partitioning strategy which starts from the top-left pixel, and the 8 × 8 feature map is evenly partitioned into 2 × 2 windows of size 4 × 4 (M = 4). Then, the next module adopts a windowing configuration that is shifted from that of the preceding layer, by displacing the windows by (|M/2| , |M/2|) pixels from the regularly partitioned windows.如图2所示,第一个模块使用从左上角像素开始的常规窗口划分策略,将8 × 8特征图均匀划分为大小为4 × 4(M = 4)的2 × 2窗口。然后,下一个模块采用从前一层移位的窗口配置,通过将窗口移位(|M/2|、|M/2|)来自规则划分的窗口的像素。
With the shifted window partitioning approach, consecutive Swin Transformer blocks are computed as使用移位窗口分区方法,连续Swin Transformer块计算为
where $\hat{z}^l$ and $z^l$ denote the output features of the (S)WMSA module and the MLP module for block l, respectively; W-MSA and SW-MSA denote window based multi-head self-attention using regular and shifted window partitioning configurations, respectively.其中,
$\hat{z}^l$ 和
$z^l$ 分别表示块1的(S)WMSA模块和MLP模块的输出特征; W-MSA和SW-MSA分别表示使用常规和移位窗口划分配置的基于窗口的多头自注意。
The shifted window partitioning approach introduces connections between neighboring non-overlapping windows in the previous layer and is found to be effective in image classification, object detection, and semantic segmentation, as shown in Table 4.移位窗口划分方法引入了前一层中相邻非重叠窗口之间的连接,并且被发现在图像分类、对象检测和语义分割中是有效的,如表4所示。
Efficient batch computation for shifted configuration移位配置的高效批处理计算
An issue with shifted window partitioning is that it will result in more windows, from $\lceil\frac {h} {M} \rceil$ × $\lceil\frac {w} {M} \rceil$ to ($\lceil\frac {h} {M} \rceil$+ 1) × ($\lceil\frac {w} {M} \rceil$+1) in the shifted configuration, and some of the windows will be smaller than M ×M. A naive solution is to pad the smaller windows to a size of M × M and mask out the padded values when computing attention. When the number of windows in regular partitioning is small, e.g. 2 × 2, the increased computation with this naive solution is considerable (2 × 2 → 3 × 3, which is 2.25 times greater). Here, we propose a more efficient batch computation approach by cyclic-shifting toward the top-left direction, as illustrated in Figure 4. After this shift, a batched window may be composed of several sub-windows that are not adjacent in the feature map, so a masking mechanism is employed to limit self-attention computation to within each sub-window. With the cyclic-shift, the number of batched windows remains the same as that of regular window partitioning, and thus is also efficient. The low latency of this approach is shown in Table 5.移位窗口划分的一个问题是,它将导致更多的窗口,从移位配置中的
$\lceil\frac {h} {M} \rceil$ ×
$\lceil\frac {w} {M} \rceil$到(
$\lceil\frac {w} {M} \rceil$ + 1)×(
$\lceil\frac {h} {M} \rceil$ +1),并且一些窗口将小于M ×M。一个简单的解决方案是将较小的窗口填充到M × M的大小,并在计算注意力时屏蔽填充值。当常规分区中的窗口数量很小时,例如2 × 2,使用这种简单的解决方案增加的计算量是相当大的(2 × 2 → 3 × 3,这是2.25倍)。在这里,我们提出了一种更有效的批处理计算方法,即向左上方循环移位,如图4所示。在该移位之后,批处理窗口可以由在特征图中不相邻的若干子窗口组成,因此采用掩蔽机制来将自注意力计算限制在每个子窗口内。通过循环移位,批量窗口的数量与常规窗口划分的数量保持相同,因此也是有效的。这种方法的低延迟如表5所示。
Relative position bias相对位置偏差
In computing self-attention, we follow [45, 1, 29, 30] by including a relative position bias B ∈ $R^(M^2M^2)]$ to each head in computing similarity:在计算自我注意力时,我们遵循[45,1,29,30],在计算相似度时包括每个头部的相对位置偏差B ∈ R^(M^2×M^2):
where Q,K, V ∈ $R^(M^2d)$ are the query, key and value matrices; d is the query/key dimension, and M2 is the number of patches in a window. Since the relative position along each axis lies in the range [−M+1,M−1], we parameterize a smaller-sized bias matrix ˆB ∈ $R^(2M−1)*(2M−1)$, and values in B are taken from ˆB.其中Q、K、V ∈ RM 2 ×d是查询、键和值矩阵; d是查询/键维度,M2是窗口中的补丁数量。由于沿沿着每个轴的相对位置位于范围[−M+1,M−1]内,我们参数化一个较小的偏置矩阵B ∈ R^[(2M−1)*(2M−1)],B中的值取自ˆB。
We observe significant improvements over counterparts without this bias term or that use absolute position embedding, as shown in Table 4. Further adding absolute position embedding to the input as in [19] drops performance slightly, thus it is not adopted in our implementation.如表4所示,我们观察到与没有此偏置项或使用绝对位置嵌入的同行相比的显着改进。如[19]中进一步向输入添加绝对位置嵌入会略微降低性能,因此在我们的实现中不采用它。
The learnt relative position bias in pre-training can be also used to initialize a model for fine-tuning with a different window size through bi-cubic interpolation [19, 57].在预训练中学习的相对位置偏差也可以用于初始化模型,以通过双三次插值[19,57]使用不同的窗口大小进行微调。
3.3. Architecture Variants
3.3.架构变体
We build our base model, called Swin-B, to have of model size and computation complexity similar to ViTB/DeiT-B. We also introduce Swin-T, Swin-S and Swin-L, which are versions of about 0.25×, 0.5× and 2× the model size and computational complexity, respectively. Note that the complexity of Swin-T and Swin-S are similar to those of ResNet-50 (DeiT-S) and ResNet-101, respectively. The window size is set to M = 7 by default. The query dimension of each head is d = 32, and the expansion layer of each MLP is α = 4, for all experiments. The architecture hyper-parameters of these model variants are:我们构建了我们的基础模型,称为Swin-B,具有类似于ViTB/DeiT-B的模型大小和计算复杂性。我们还介绍了Swin-T,Swin-S和Swin-L,它们分别是模型大小和计算复杂度的0.25倍,0.5倍和2倍。请注意,Swin-T和Swin-S的复杂度分别与ResNet-50(DeiT-S)和ResNet-101相似。窗口大小默认设置为M = 7。对于所有实验,每个头的查询维度为d = 32,并且每个MLP的扩展层为α = 4。这些模型变体的架构超参数是:
Swin-T: C = 96, layer numbers = {2, 2, 6, 2}
Swin-S: C = 96, layer numbers ={2, 2, 18, 2}
Swin-B: C = 128, layer numbers ={2, 2, 18, 2}
Swin-L: C = 192, layer numbers ={2, 2, 18, 2}
where C is the channel number of the hidden layers in the first stage. The model size, theoretical computational complexity (FLOPs), and throughput of the model variants for ImageNet image classification are listed in Table 1.其中C是第一阶段中隐藏层的通道编号。ImageNet图像分类的模型大小、理论计算复杂度(FLOPs)和模型变量的吞吐量列于表1中。
4. Experiments
4.实验
We conduct experiments on ImageNet-1K image classification [18], COCO object detection [39], and ADE20K semantic segmentation [74]. In the following, we first compare the proposed Swin Transformer architecture with the previous state-of-the-arts on the three tasks. Then, we ablate the important design elements of Swin Transformer.我们对ImageNet-1 K图像分类[18],COCO对象检测[39]和ADE 20 K语义分割[74]进行了实验。在下文中,我们首先将所提出的Swin Transformer架构与之前的三个任务的最新技术进行比较。然后,详细介绍了Swin Transformer的设计要点。
4.1. Image Classification on ImageNet-1K
4.1.基于ImageNet-1 K的图像分类
Settings
For image classification, we benchmark the proposed Swin Transformer on ImageNet-1K [18], which contains 1.28M training images and 50K validation images from 1,000 classes. The top-1 accuracy on a single crop is reported. We consider two training settings:对于图像分类,我们在ImageNet-1 K [18]上对所提出的Swin Transformer进行了基准测试,ImageNet-1 K包含来自1,000个类的1.28 M训练图像和50 K验证图像。报告单次裁剪的前1精度。我们考虑两种训练设置:
- Regular ImageNet-1K training. This setting mostly follows [57]. We employ an AdamW [33] optimizer for 300 epochs using a cosine decay learning rate scheduler and 20 epochs of linear warm-up. A batch size of 1024, an initial learning rate of 0.001, and a weight decay of 0.05 are used. We include most of the augmentation and regularization strategies of [57] in training, except for repeated augmentation [28] and EMA [41], which do not enhance performance. Note that this is contrary to [57] where repeated augmentation is crucial to stabilize the training of ViT.
ImageNet-1 K训练这一点,大多数人都有[57]。我们采用AdamW [33]优化器,使用余弦衰减学习率调度器和20个线性预热时期进行300个时期。批量大小为1024,初始学习率为0.001,权重衰减为0.05。我们在训练中包括了[57]中的大多数增强和正则化策略,除了重复增强[28]和EMA [41],它们不会提高性能。请注意,这与[57]相反,其中重复增强对于稳定ViT的训练至关重要。
- Pre-training on ImageNet-22K and fine-tuning on ImageNet-1K. We also pre-train on the ImageNet-22K dataset, which contains 14.2 million images and 22K classes. We employ an AdamW optimizer for 90 epochs using a cosine learning rate scheduler with a 5-epoch linear warm-up. A batch size of 4096, an initial learning rate of 0.001, and a weight decay of 0.01 are used. In ImageNet-1K fine-tuning, we train for 30 epochs with a batch size of 1024, a constant learning rate of 10^−5, and a weight decay of 10^−8.
在ImageNet-22 K上进行预训练,并在ImageNet-1 K上进行微调。我们还在ImageNet-22 K数据集上进行了预训练,该数据集包含1420万张图像和22 K个类。我们采用AdamW优化器为90个历元使用余弦学习率调度与5历元线性热身。批量大小为4096,初始学习率为0.001,权重衰减为0.01。在ImageNet-1 K微调中,我们训练了30个epoch,批量大小为1024,恒定学习率为10^−5,权重衰减为10^−8。
Results with regular ImageNet-1K training常规ImageNet-1 K训练的结果
Table 1(a) presents comparisons to other backbones, including both Transformer-based and ConvNet-based, using regular ImageNet-1K training.Table 1(a) presents comparisons to other backbones, including both Transformer-based and ConvNet-based, using regular ImageNet-1K training.
Compared to the previous state-of-the-art Transformerbased architecture, i.e. DeiT [57], Swin Transformers noticeably surpass the counterpart DeiT architectures with similar complexities: +1.5% for Swin-T (81.3%) over DeiT-S (79.8%) using 224^2 input, and +1.5%/1.4% for Swin-B (83.3%/84.5%) over DeiT-B (81.8%/83.1%) using 224^2/384^2 input, respectively.与之前最先进的基于变压器的架构(即DeiT [57])相比,Swin变压器明显优于具有类似复杂性的DeiT架构:使用224^2输入,Swin-T(81.3%)比DeiT-S(79.8%)分别增加1.5%,使用224^2/384^2输入,Swin-B(83.3%/84.5%)比DeiT-B(81.8%/83.1%)分别增加1.5%/1.4%。
Compared with the state-of-the-art ConvNets, i.e. RegNet [44], the Swin Transformer achieves a slightly better speed-accuracy trade-off. Noting that while RegNet [44] are obtained via a thorough architecture search, the Swin Transformer is manually adapted from a standard Transformer and has potential for further improvement.与最先进的ConvNets(即RegNet [44])相比,Swin Transformer实现了略好的速度-精度权衡。注意,虽然RegNet [44]是通过彻底的架构搜索获得的,但Swin Transformer是从标准Transformer手动改编的,并且具有进一步改进的潜力。
Results with ImageNet-22K pre-trainingImageNet-22 K预训练结果
We also pretrain the larger-capacity Swin-B and Swin-L on ImageNet22K. Results fine-tuned on ImageNet-1K image classification are shown in Table 1(b). For Swin-B, the ImageNet22K pre-training brings 1.8%∼1.9% gains over training on ImageNet-1K from scratch. Compared with the previous best results for ImageNet-22K pre-training, our models achieve significantly better speed-accuracy trade-offs: Swin-B obtains 86.4% top-1 accuracy, which is 2.4% higher than that of ViT with similar inference throughput (84.7 vs. 85.9 images/sec) and slightly lower FLOPs (47.0G vs. 55.4G). The larger Swin-L model achieves 87.3% top-1 accuracy, +0.9% better than that of the Swin-B model.我们还在ImageNet 22 K上预训练了更大容量的Swin-B和Swin-L。ImageNet-1 K图像分类的微调结果如表1(b)所示。对于Swin-B,ImageNet 22 K预训练比ImageNet-1 K从头开始的训练带来了1.8%到1.9%的增益。与之前ImageNet-22 K预训练的最佳结果相比,我们的模型实现了更好的速度-准确性权衡:Swin-B获得了86.4%的top-1准确率,比ViT高2.4%,具有相似的推理吞吐量(84.7 vs. 85.9图像/秒)和略低的FLOP(47.0G vs. 55.4G)。较大的Swin-L模型达到了87.3%的top-1准确率,比Swin-B模型高出+0.9%。
4.2. Object Detection on COCO
4.2.COCO上的目标检测
Settings
Object detection and instance segmentation experiments are conducted on COCO 2017, which contains 118K training, 5K validation and 20K test-dev images. An ablation study is performed using the validation set, and a system-level comparison is reported on test-dev. For the ablation study, we consider four typical object detection frameworks: Cascade Mask R-CNN [26, 6], ATSS [71], RepPoints v2 [12], and Sparse RCNN [52] in mmdetection [10]. For these four frameworks, we utilize the same settings: multi-scale training [8, 52] (resizing the input such that the shorter side is between 480 and 800 while the longer side is at most 1333), AdamW [40] optimizer (initial learning rate of 0.0001, weight decay of 0.05, and batch size of 16), and 3x schedule (36 epochs). For system-level comparison, we adopt an improved HTC [9] (denoted as HTC++) with instaboost [20], stronger multi-scale training [7], 6x schedule (72 epochs), soft-NMS [5], and ImageNet-22K pre-trained model as initialization.在COCO 2017上进行了目标检测和实例分割实验,其中包含118 K训练,5 K验证和20 K测试开发图像。使用确认集进行消融研究,并在测试开发中报告系统级比较。对于消融研究,我们考虑了四个典型的对象检测框架:级联掩码R-CNN [26,6],ATSS [71],RepPoints v2 [12]和稀疏RCNN [52]在mm检测[10]中。对于这四个框架,我们使用相同的设置:多尺度训练[8,52](对输入进行优化,使短边在480到800之间,而长边最多为1333),AdamW [40]优化器(初始学习率为0.0001,权重衰减为0.05,批量大小为16)和3x调度(36个epoch)。为了进行系统级比较,我们采用了改进的HTC [9](表示为HTC++),其中包括instaboost [20],更强的多尺度训练[7],6x时间表(72 epochs),soft-NMS [5]和ImageNet-22 K预训练模型作为初始化。
We compare our Swin Transformer to standard ConvNets, i.e. ResNe(X)t, and previous Transformer networks, e.g. DeiT. The comparisons are conducted by changing only the backbones with other settings unchanged. Note that while Swin Transformer and ResNe(X)t are directly applicable to all the above frameworks because of their hierarchical feature maps, DeiT only produces a single resolution of feature maps and cannot be directly applied. For fair comparison, we follow [73] to construct hierarchical feature maps for DeiT using deconvolution layers.我们将Swin Transformer与标准ConvNets(即ResNe(X)t)和以前的Transformer网络(例如DeiT)进行比较。通过仅改变主干而其他设置不变来进行比较。请注意,虽然Swin Transformer和ResNe(X)t由于其分层特征映射而直接适用于所有上述框架,但DeiT仅产生单一分辨率的特征映射,不能直接应用。为了公平比较,我们遵循[73]使用反卷积层构建DeiT的分层特征图。
Comparison to ResNe(X)t与ResNe(X)t的比较
Table 2(a) lists the results of Swin-T and ResNet-50 on the four object detection frameworks. Our Swin-T architecture brings consistent +3.4∼4.2 box AP gains over ResNet-50, with slightly larger model size, FLOPs and latency.表2(a)列出了Swin-T和ResNet-50在四个对象检测框架上的结果。我们的Swin-T架构带来了比ResNet-50一致的+3.4 4. 2盒AP增益,模型大小、FLOP和延迟略大。
Table 2(b) compares Swin Transformer and ResNe(X)t under different model capacity using Cascade Mask RCNN. Swin Transformer achieves a high detection accuracy of 51.9 box AP and 45.0 mask AP, which are significant gains of +3.6 box AP and +3.3 mask AP over ResNeXt10164x4d, which has similar model size, FLOPs and latency. On a higher baseline of 52.3 box AP and 46.0 mask AP using an improved HTC framework, the gains by Swin Transformer are also high, at +4.1 box AP and +3.1 mask AP (see Table 2(c)). Regarding inference speed, while ResNe(X)t is built by highly optimized Cudnn functions, our architecture is implemented with built-in PyTorch functions that are not all well-optimized. A thorough kernel optimization is beyond the scope of this paper.表2(b)使用级联掩码RCNN比较了不同模型容量下的Swin Transformer和ResNe(X)t。Swin Transformer实现了51.9 box AP和45.0 mask AP的高检测准确度,这比ResNeXt10164x4d显著提高了+3.6 box AP和+3.3 mask AP,ResNeXt10164x4d具有相似的模型大小,FLOPs和延迟。在使用改进的HTC框架的52.3盒AP和46.0掩模AP的较高基线上,Swin Transformer的增益也很高,为+4.1盒AP和+3.1掩模AP(参见表2(c))。关于推理速度,虽然ResNe(X)t是由高度优化的Cudnn函数构建的,但我们的架构是用内置的PyTorch函数实现的,这些函数并没有得到很好的优化。彻底的内核优化超出了本文的范围。
Comparison to DeiT与DeiT比较
The performance of DeiT-S using the Cascade Mask R-CNN framework is shown in Table 2(b). The results of Swin-T are +2.5 box AP and +2.3 mask AP higher than DeiT-S with similar model size (86M vs. 80M) and significantly higher inference speed (15.3 FPS vs. 10.4 FPS). The lower inference speed of DeiT is mainly due to its quadratic complexity to input image size.使用级联掩码R-CNN框架的DeiT-S的性能如表2(b)所示。Swin-T的结果比DeiT-S高+2.5框AP和+2.3掩模AP,具有相似的模型大小(86 M与80 M)和显著更高的推理速度(15.3 FPS与10.4 FPS)。DeiT的推理速度较低,主要是由于其复杂度是输入图像大小的二次方。
Comparison to previous state-of-the-art与先前最先进技术的比较
Table 2(c) compares our best results with those of previous state-ofthe-art models. Our best model achieves 58.7 box AP and 51.1 mask AP on COCO test-dev, surpassing the previous best results by +2.7 box AP (Copy-paste [23] without external data) and +2.6 mask AP (DetectoRS [42]).表2(c)将我们的最佳结果与以前最先进的模型进行了比较。我们的最佳模型在COCO测试开发中实现了58.7框AP和51.1掩模AP,超过了之前的最佳结果+2.7框AP(没有外部数据的复制粘贴[23])和+2.6掩模AP(DetectoRS [42])。
4.3. Semantic Segmentation on ADE20K
4.3.基于ADE20K的语义分割
Settings
ADE20K [74] is a widely-used semantic segmentation dataset, covering a broad range of 150 semantic categories. It has 25K images in total, with 20K for training, 2K for validation, and another 3K for testing. We utilize UperNet [63] in mmseg [16] as our base framework for its high efficiency. More details are presented in the Appendix.ADE20K [74]是一个广泛使用的语义分割数据集,涵盖了150个语义类别。它总共有25K张图像,其中20K用于训练,2K用于验证,另外3K用于测试。我们利用mmseg [16]中的UperNet [63]作为我们的基础框架,以提高其效率。更多详情见附录。
Results
Table 3 lists the mIoU, model size (#param), FLOPs and FPS for different method/backbone pairs. From these results, it can be seen that Swin-S is +5.3 mIoU higher (49.3 vs. 44.0) than DeiT-S with similar computation cost. It is also +4.4 mIoU higher than ResNet-101, and +2.4 mIoU higher than ResNeSt-101 [70]. Our Swin-L model with ImageNet-22K pre-training achieves 53.5 mIoU on the val set, surpassing the previous best model by +3.2 mIoU (50.3 mIoU by SETR [73] which has a larger model size).表3列出了不同方法/主干对的mIoU、模型大小(#param)、FLOP和FPS。从这些结果可以看出,Swin-S比DeiT-S高+5.3 mIoU(49.3 vs. 44.0),具有相似的计算成本。它也比ResNet-101高+4.4 mIoU,比ResNeSt-101高+2.4 mIoU [70]。我们使用ImageNet-22 K预训练的Swin-L模型在瓦尔集上实现了53.5 mIoU,超过了之前的最佳模型+3.2 mIoU(SETR [73]的50.3 mIoU,模型大小更大)。
4.4. Ablation Study
4.4.消融研究
In this section, we ablate important design elements in the proposed Swin Transformer, using ImageNet-1K image classification, Cascade Mask R-CNN on COCO object detection, and UperNet on ADE20K semantic segmentation.在本节中,我们使用ImageNet-1 K图像分类、Cascade Mask R-CNN进行COCO对象检测,以及UperNet进行ADE 20 K语义分割,消除了Swin Transformer中的重要设计元素。
Shifted windows
Ablations of the shifted window approach on the three tasks are reported in Table 4. Swin-T with the shifted window partitioning outperforms the counterpart built on a single window partitioning at each stage by +1.1% top-1 accuracy on ImageNet-1K, +2.8 box AP/+2.2 mask AP on COCO, and +2.8 mIoU on ADE20K. The results indicate the effectiveness of using shifted windows to build connections among windows in the preceding layers. The latency overhead by shifted window is also small, as shown in Table 5.表4报告了移位窗口法在三项任务上的消融。在ImageNet-1 K上,具有移位窗口分区的Swin-T在每个阶段的性能都优于基于单窗口分区的Swin-T,其top-1准确率为+1.1%,在COCO上为+2.8 box AP/+2.2 mask AP,在ADE 20 K上为+2.8 mIoU。结果表明,使用移动窗口在前几层的窗口之间建立连接的有效性。移位窗口的延迟开销也很小,如表5所示。
Relative position bias
Table 4 shows comparisons of different position embedding approaches. Swin-T with relative position bias yields +1.2%/+0.8% top-1 accuracy on ImageNet-1K, +1.3/+1.5 box AP and +1.1/+1.3 mask AP on COCO, and +2.3/+2.9 mIoU on ADE20K in relation to those without position encoding and with absolute position embedding, respectively, indicating the effectiveness of the relative position bias. Also note that while the inclusion of absolute position embedding improves image classification accuracy (+0.4%), it harms object detection and semantic segmentation (-0.2 box/mask AP on COCO and -0.6 mIoU on ADE20K).表4显示了不同位置嵌入方法的比较。相对位置偏差的Swin-T在ImageNet-1 K上产生+1.2%/+0.8%的top-1准确度,在COCO上产生+1.3/+1.5框AP和+1.1/+1.3掩模AP,在ADE 20 K上产生+2.3/+2.9 mIoU,分别与没有位置编码和绝对位置嵌入的情况相比,表明相对位置偏差的有效性。还请注意,虽然包含绝对位置嵌入提高了图像分类准确性(+0.4%),但它损害了对象检测和语义分割(COCO上为-0.2框/掩码AP,ADE 20 K上为-0.6 mIoU)。
Different self-attention methods不同的自我注意方法
The real speed of different self-attention computation methods and implementations are compared in Table 5. Our cyclic implementation is more hardware efficient than naive padding, particularly for deeper stages. Overall, it brings a 13%, 18% and 18% speed-up on Swin-T, Swin-S and Swin-B, respectively.表5比较了不同自注意力计算方法和实现的真实的速度。我们的循环实现比朴素填充更有硬件效率,特别是对于更深的阶段。总体而言,它分别为Swin-T,Swin-S和Swin-B带来了13%,18%和18%的速度提升。
The self-attention modules built on the proposed shifted window approach are 40.8×/2.5×, 20.2×/2.5×, 9.3×/2.1×, and 7.6×/1.8× more efficient than those of sliding windows in naive/kernel implementations on four network stages, respectively. Overall, the Swin Transformer architectures built on shifted windows are 4.1/1.5, 4.0/1.5, 3.6/1.5 times faster than variants built on sliding windows for Swin-T, Swin-S, and Swin-B, respectively. Table 6 compares their accuracy on the three tasks, showing that they are similarly accurate in visual modeling.在四个网络阶段上,基于所提出的移动窗口方法构建的自注意模块的效率分别比朴素/内核实现中的滑动窗口高40.8×/2.5 ×、20.2×/2.5×、9.3×/2.1×和7.6×/1.8×。总体而言,构建在移位窗口上的Swin Transformer架构分别比构建在Swin-T、Swin-S和Swin-B滑动窗口上的变体快4.1/1.5、4.0/1.5、3.6/1.5倍。表6比较了它们在三个任务上的准确性,表明它们在视觉建模中的准确性相似。
Compared to Performer [14], which is one of the fastest Transformer architectures (see [55]), the proposed shifted window based self-attention computation and the overall Swin Transformer architectures are slightly faster (see Table 5), while achieving +2.3% top-1 accuracy compared to Performer on ImageNet-1K using Swin-T (see Table 6).与Performer [14]相比,Performer是最快的Transformer架构之一(见[55]),提出的基于移位窗口的自注意力计算和整体Swin Transformer架构略快(见表5),同时与使用Swin-T的ImageNet-1 K上的Performer相比,实现了+2.3%的top-1准确度(见表6)。
5. Conclusion
This paper presents Swin Transformer, a new vision Transformer which produces a hierarchical feature representation and has linear computational complexity with respect to input image size. Swin Transformer achieves the state-of-the-art performance on COCO object detection and ADE20K semantic segmentation, significantly surpassing previous best methods. We hope that Swin Transformer’s strong performance on various vision problems will encourage unified modeling of vision and language signals.本文介绍了Swin Transformer,一个新的视觉Transformer,它产生一个层次化的特征表示,并具有线性计算复杂度相对于输入图像的大小。Swin Transformer在COCO对象检测和ADE 20K语义分割方面实现了最先进的性能,大大超过了以前的最佳方法。我们希望Swin Transformer在各种视觉问题上的强大性能将鼓励视觉和语言信号的统一建模。