This project aims to reproduce the results presented in the in the paper WB-DETR: Transformer-Based Detector without Backbone by Fanfan Liu, Haoran Wei, Wenzhe Zhao, Guozhen Li, Jingquan Peng, Zihao Li. [1]
The first pure-transformer detector WB-DETR (DETR-Based Detector without Backbone) is only composed of an encoder and a decoder without any CNN-based backbones. Instead of utilizing a CNN to extract features, WB-DETR serializes the image directly and encodes the local features of input into each individual token. Besides, to allow WB-DETR better make up the deficiency of transformer in modeling local information, a LIE-T2T (Local Information Enhancement Tokens-to Token) module is designed to modulate the internal (local) information of each token after unfolding. Unlike other traditional detectors, WB-DETR without backbone is more unify and neat. Experimental results demonstrate that WB-DETR, the first pure-transformer detector without CNN, yields on par accuracy and faster inference speed with only half number of parameters compared with DETR[4] baseline.
After the exploration of the success of the transformers it is applied to the object detection area. In this area, unlike previous CNN-based works, DETR[4] is introduced as a transformer-based detector with CNN backbone. We also know vision transformers[3] which conduct sequence modeling of the patches and still have worse performance compared with CNNs. Because the simple tokenization of the input images fails to model the important local structures like edges or lines. Tokens-to-token Vision Transformer (T2T-Vit)[2] solves the problem of vision transformers by recursively aggregating neighboring tokens into one token. However in T2T the local information in each token and the information between adjacent tokens were not modeled well. So in this paper they propose Local Information Enhancement T2T module which not only reorganizes the adjacent tokens but also uses attention on channel-dimension of each token to enhance the local information.
The process of Image to Tokens. Take an input image with 512×512×3 as an example. Firstly, the image is cut to 1024 patches with the size of 32×32 × 3. Then, each patch is reshaped to one-dimensional. Finally, a trainable linear projection is performed to yield required tokens.
They follow the ViT to handle 2D images. Firstly, They cut the image to a size of (
After the process of image to tokens, they add positional encodings to target tokens to make them carry location information. Then, the resulting sequence of embedding vectors serves as input to the encoder, as shown above. Each encoder layer keeps a standard architecture which consists of a multi-head self-attention module and a feed forward network (FFN). An LIE-T2T module is equipped behind each encoder layer to constitute the LIE- T2T encoder. The LIE-T2T module can progressively reduce the length of tokens and transform the spatial structure of the image. Since they do not use any CNN-based backbone to extract image features, instead of directly serializing the image, the local information of the image is encoded in each independent token.
Concretely, LIE-T2T module calculates attention on the channel-dimension of each token. The attention is calculated separately for each token. More detailed iterative pro- cess of LIE-T2T module is shown in Figure 5, which can also be formulated as follows:
-
$T$ =$Unfold$ ($Reshape$ ($T_{i}$ )) -
$S$ =$Sigmoid$ ($W_{2}$ · ReLU ($W_{1}$ ·$T$ )) -
$T_{i}$ +1 =$W_{3}$ · ($T$ ·$S$ )
where Reshape means the operation: reorganize (
T2T[2] aggregates the information of adjacent tokens through reshape and unfold operations. Based on T2T, LIE-T2T[1] can realize local spatial attention of reshaped
Additionally, as the length of tokens in the T2T module[2] is larger than the normal case (16 × 16) in ViT[3], the MACs and memory usage are huge. To address the limitations, in our T2T module[2], we set the channel dimension of the T2T layer small (32 or 64) to reduce MACs, and optionally adopt an efficient Transformer such as Performer[5] layer to reduce memory usage at limited GPU memory.
Performers[5] are a new class of models and they approximate the Transformers. They do so without running into the classic transformer bottleneck which is that, the Attention matrix in the transformer has space and compute requirements that are quadratic in the size of the input, and that limits how much input (text or images) you can feed into the model. Performers get around this problem by a technique called Fast Attention Via Positive Orthogonal Random Features (FAVOR+). Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence, and low estimation variance. Performers; capable of provably accurate and practical estimation of regular (softmax) full-rank attention, but of only linear space and time complexity and not relying on any priors such as sparsity or low-rankness. Performers are the first linear architectures fully compatible (via small amounts of fine-tuning) with regular Transformers.
In our interpretation, mainly original DETR code is used for transformer decoder model initialization and forward, model parallelizing, training, evaluation, data read, data augmentation etc. All of our changes are applied to the backbone part of the code. For token transformer and token performer parts are taken from T2T-ViT code and our image to token and LIE-T2T implementations are inspired from the same code. In Section 3.2 of the paper, positional encoding process is introduced as a 1D process, in contrast to the 2D positional embedding in DETR. For that, Positional encoding code of DETR[4] paper is adapted to 1D by us.
In the paper, none of the kernel sizes, padding or stride of any Unfold layer is explained. Kernel sizes are assumed by 3 and padding is assumed as 1 since they are the most common options in a backbone. However, larger kernel sizes may affect the model like a deeper network and it may create a performance difference with better/worse results. Stride is selected as 2 for first M layers and 1 for the rest, and M is selected as 5 in order to match 32 step size. Lower step size causes exponential GPU usage for our case, which causes CUDA out of memory error. Higher step size causes exponentially worse detection performence.
The paper introduces their experimental setup as follows: The main settings and training strategy of WB-DETR are mainly followed by DETR[4] for better comparisons. All transformer weights are initialized with Xavier Init, and our model has no pre-train process on any external dataset. By default, models are trained for 500 epochs with a learning rate drop 10× at the 400 epoch. We optimize WB-DETR via an Adam optimizer with a base learning rate of 1e−4 and a weight decay of 0.001. We use a batch size of 32 and train the network on 16 V100 GPUs with 4 images per-GPU. We use some standard data augmentations, such as random resizing, color jittering, random flipping and so on to overcome the overfitting. The transformer is trained with a default dropout of 0.1. We fix the number of decoding layers at 6 and report performance with the different layer number N and K of encoder: When N and K is n and k, the corresponding model is named as
We have tried to follow the same experimental setup as explained in the paper, except the batch size. Since we have used a TUBITAK TRUBA instance for intensive training, we have used 8 NVIDIA A100 GPUs with 12 images per-GPU. The batch size explained in the paper's setup section is not clear, as they firstly say that their batch size is 32, then 4 images for each of 16 GPUs (64). In any way, we were able to use higher batch size (96), in order to have faster convergence in a limited time.
There are no extra compiled components in WBDETR and package dependencies are minimal, so the code is very simple to use. We provide instructions how to install dependencies via conda. First, clone the repository locally:
install PyTorch 1.5+ and torchvision 0.6+:
conda install -c pytorch pytorch torchvision
Install pycocotools (for evaluation on COCO) and scipy (for training):
conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
That's it, should be good to train and evaluate detection models.
Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:
path/to/coco/
annotations/ # annotation json files
train2017/ # train images
val2017/ # val images
To train baseline WBDETR on a single node with 8 gpus for 500 epochs run:
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --cfg config.yaml
To evaluate WB-DETR(2,8) on COCO val5k with a single GPU run:
python main.py --batch_size 2 --eval --resume checkpoint.pth --cfg config.yaml
PyTorch training code and pretrained models for WB-DETR.
Our code is able to any (N,K) pair in WB-DETR(N-K) experiments. Yet, due to the limitation in computational resources only two experiments are done. One of it is WB-DETR (0-4) for ~350 epochs and the other is WB-DETR(2-8) for ~500 epochs.
We provide baseline WB-DETR models. AP is computed on COCO 2017 val5k.
| name | patch | step size | epochs | AP 0.5:0.95 (%) | AP 0.5 (%) | url | |
|---|---|---|---|---|---|---|---|
| 0 | WB-DETR(2-8) paper | 32 | 32 | 500 | 33.9 | 61.0 | |
| 1 | WB-DETR(2-8) ours | 32 | 32 | 500 | 22.3 | 38.0 | model | logs |
| 2 | WB-DETR(0-4) ours | 32 | 32 | 341 | 14.4 | 27.0 | model | logs |
Here is the comparison of WB-DETR(2-8) (LIE-T2T) and WB-DETR(0-4) (T2T) models.
AP metric
AP_50 metric
Classification Error
Loss
According to the results, even though it is not fair to compare with different number of layers and epochs, it can be seen that LIE-T2T outperforms the original T2T by enchancing local information. For LIE-T2T, learning rate is dropped at 400th epoch and it increases the performance even more.
Our results are worse than the proposed results. There might be a couple of differences between the original method and our interpretation such as kernel sizes, padding or stride of Unfold layers. It is not clear that if the original method uses transformer or performers, we used performer in LIE-T2T encoder because of limitation in hardware. We did our experiments using 32 step size because lower step size causes exponential GPU usage for our case, which causes CUDA out of memory error. Higher step size causes exponentially worse detection performence.
- [1] F. Liu, H. Wei, W. Zhao, G. Li, J. Peng and Z. Li, "WB-DETR: Transformer-Based Detector without Backbone," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2959-2967, doi: 10.1109/ICCV48922.2021.00297.
- [2] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan, "Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet", Computer Vision and Pattern Recognition (CVPR) 2021, doi: 10.48550/ARXIV.2101.11986
- [3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", Computer Vision and Pattern Recognition (CVPR) 2021, doi: 10.48550/ARXIV.2010.11929
- [4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko, "End-to-End Object Detection with Transformers", Computer Vision and Pattern Recognition (CVPR) 2020, doi: 10.48550/arXiv.2005.12872
- [5] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller, "Rethinking Attention with Performers", Machine Learning 2021, doi: 10.48550/ARXIV.2009.14794








