---TRANSFORMER论文合集03-Conv Transformer
1-LeViT a Vision Transformer in ConvNet’s Clothing for Faster Inference
10-How to Train Vision Transformer on Small-scale Datasets
11-Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets
12-Inception Transformer
13-Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets
14-DESIGNING BERT FOR CONVOLUTIONAL NETWORKS SPARSE AND HIERARCHICAL MASKED MODELING
14-MOAT ALTERNATING MOBILE CONVOLUTION AND ATTENTION BRINGS STRONG VISION MODELS
15-InternImage Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
16- PSLT A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift
2-Incorporating Convolution Designs into Visual Transformers
3-Conformer Local Features Coupling Global Representations for Visual Recognition
4-Co-Scale Conv-Attentional Image Transformers
5- Introducing Convolutions to Vision Transformers
6-MOBILEVIT LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER,
7-Mobile-Former Bridging MobileNet and Transformer
8-TinyViT Fast Pretraining Distillation for Small Vision Transformers
9-ParC-Net Position Aware Circular Convolution with Merits from ConvNets and Transformer
---TRANSFORMER论文合集04-Training Transformer
01-Generative Pretraining from Pixels
02-Learning Transferable Visual Models From Natural Language Supervision
03-An Empirical Study of Training Self-Supervised Vision Transformers
04-Emerging Properties in Self-Supervised Vision Transformers
05-Efficient Training of Visual Transformers with Small Datasets
06-Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning
07- Masked Self-Supervised Transformer for Visual Representation
08-BERT Pre-Training of Image Transformers
09- IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER
10-Automated Progressive Learning for Efficient Training of Vision Transformers
11-Masked Autoencoders Are Scalable Vision Learners
12-a Simple Framework for Masked Image Modeling
13-Patch-level Representation Learning for Self-supervised Vision Transformers
14-Towards Liberating Vision Transformers from Pre-training
15-Attend to Mix for Vision Transformers
16-A General Framework for Self-supervised Learning in Speech, Vision and Language
17-Self-supervised Models are Good Teaching Assistants for Vision Transformers
18-Position Prediction as an Effective Pretraining Strategy
19-Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning
20-Bootstrapped Masked Autoencoders for Vision BERT Pretraining
21-Rethinking Image Mixing for Data Augmentation in Vision Transformers
22-Locality Guidance for Improving Vision Transformers on Tiny Datasets
23-Improving Vision Transformers by Revisiting High-frequency Components
24-What to Hide from Your Students Attention-Guided Masked Image Modeling
25-Self-supervision meets Language-Image Pre-training
26-Multi-choice Discretization for Image BERT Pre-training
27-Scalable Learning to Optimize A Learned Optimizer Can Train Big Models
28- TokenMixup Efficient Attention-guided Token-level Data Augmentation for Transformers
29-Green Hierarchical Vision Transformer for Masked Image Modeling
30-MIXPRO DATA AUGMENTATION WITH MASKMIX AND PROGRESSIVE ATTENTION LABELING FOR VISION TRANSFORMER
31-MASKED IMAGE MODELING WITH DENOISING CONTRAST
32-MASKED FREQUENCY MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING
33-Pre-training Vision Transformers with Sinusoidal Waves
34-Learning Visual Representations via Language-Guided Sampling
35-DisCo-CLIP A Distributed Contrastive Loss for Memory Efficient CLIP Training
36-Masked Self-Distillation Advances Contrastive Language-Image Pretraining
37-MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
38-Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers
39-Integrally Pre-Trained Transformer Pyramid Networks
40-DropKey
41-One Model for All Patch Size
42-Image-and-Language Understanding from Pixels Only
43-Masked Autoencoders Enable Efficient Knowledge Distillers
44-Hard Patches Mining for Masked Image Modeling
45-Stare at What You See Masked Image Modeling without Reconstruction
46-RILS Masked Visual Reconstruction in Language Semantic Space
47-Revisiting Multimodal Representation in Contrastive Learning From Patch and Token Embeddings to Finite Discrete Tokens
48-Reproducible scaling laws for contrastive language-image learning
49-Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
50-Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
51-Stitchable Neural Networks
52-A Closer Look at Self-Supervised Lightweight Vision Transformers
53-Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models
54-Architecture-Agnostic Masked Image Modeling – From ViT back to CNN
55-Patch-level Contrastive Learning via Positional Query for Visual Pretraining
56-DreamTeacher Pretraining Image Backbones with Deep Generative Models
---TRANSFORMER论文合集05-Robustness Transformer
01-Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs
02-Are Transformers More Robust Than CNNs
03-Vision Transformers are Robust Learners
04-Towards Transferable Adversarial Attacks on Vision Transformers
05-MIA-Former Efficient and Robust Vision Transformers via Multi-grained Input Adaptation
06-PATCH-FOOL ARE VISION TRANSFORMERS ALWAYS ROBUST AGAINST ADVERSARIAL PERTURBATIONS
07-Certified Patch Robustness via Smoothed Vision Transformers
08-Towards Robust Vision Transformer
09-Visual Attention Emerges from Recurrent Sparse Reconstruction
10-Understanding The Robustness in Vision Transformers
11-Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment
12-Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem
13-ViP Unified Certified Detection and Recovery for Patch Attack with Vision Transformers
14-When Adversarial Training Meets Vision Transformers Recipes from Training to Architecture
15-Optimizing Relevance Maps of Vision Transformers Improves Robustness
16-CAN CNNS BE MORE ROBUST THAN TRANSFORMERS
17-DENOISING MASKED AUTOENCODERS HELP ROBUST CLASSIFICATION
18-Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization
---TRANSFORMER论文合集06-Model Compression Transformer
01-UNIFIED VISUAL TRANSFORMER COMPRESSION
11-UPop Unified and Progressive Pruning for Compressing Vision-Language Transformers
2-MiniViT Compressing Vision Transformers withWeight Multiplexing
3-SPViT Enabling Faster Vision Transformers via Latency-aware Soft Token Pruning
4-Patch Similarity Aware Data-Free Quantization for Vision Transformers
5-Q-ViT Accurate and Fully Quantized Low-bit Vision Transformer
6-VTC-LFC Vision Transformer Compression with Low-Frequency Components
7-PSAQ-ViT V2 Towards Accurate and General Data-Free Quantization for Vision Transformers
8-Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers
9-Pushing Binary Vision Transformers Towards Convolutional Models
---TRANSFORMER论文合集07-其他
01-On Layer Normalization in the Transformer Architecture
02-UPop Unified and Progressive Pruning for Compressing Vision-Language Transformers
03-Linear Transformers Are Secretly Fast Weight Programmers
04-Attention Is All You Need
05-Transformer with Dual Residual Connections
06-Universal Language Model Fine-tuning for Text Classification
07-Pre-training of Deep Bidirectional Transformers for Language Understanding
08-Improving Language Understanding by Generative Pre-Training
09-A Survey on Efficient Training of Transformers
10-FlashAttention Fast and Memory-Efficient Exact Attention with IO-Awareness
11-Asynchronous Methods for Deep Reinforcement Learning
12-Harnessing the Power of LLMs in Practice A Survey on ChatGPT and Beyond
13-Efficient Transformers A Survey
14-CRAMMING TRAINING A LANGUAGE MODEL ON A SINGLE GPU IN ONE DAY
15-LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
16-Scaling Down to Scale Up A Guide to Parameter-Efficient Fine-Tuning
17-Pythia A Suite for Analyzing Large Language Models Across Training and Scaling
18-Training Compute-Optimal Large Language Models
19-Scaling Language Models Methods, Analysis & Insights from Training Gopher
20-Constitutional AI Harmlessness from AI Feedback
20-Training language models to follow instructions with human feedback
21-SELF-INSTRUCT Aligning Language Models with Self-Generated Instructions
22-Fine-Tuning Language Models from Human Preferences
23-Learning to summarize from human feedback
24-BART Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
25-Training language models to follow instructions with human feedback
26-TOKEN MERGING YOUR VIT BUT FASTER
27-A Fast Post-Training Pruning Framework for Transformers
28-Swin Transformer Hierarchical Vision Transformer using Shifted Windows
29-AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
---TRANSFORMER论文合集1-General Vision Transformer
01-AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
02- General Perception with Iterative Attention
03-Rethinking Spatial Dimensions of Vision Transformers
05-Pyramid Vision Transformer A Versatile Backbone for Dense Prediction without Convolutions
06-Rethinking and Improving Relative Position Encoding for Vision Transformer
07-Going deeper with Image Transformers
08-Swin Transformer Hierarchical Vision Transformer using Shifted Windows
09-Tokens-to-Token ViT Training Vision Transformers from Scratch on ImageNet
10-DPT Deformable Patch-based Transformer for Visual Recognition
11-Focal Self-attention for Local-Global Interactions in Vision Transformers
12-Twins Revisiting the Design of Spatial Attention in Vision Transformers
13-Blending Anti-Aliasing into Vision Transformer
14-Not All Images areWorth 16x16Words Dynamic Transformers for Efficient Image Recognition
15-Transformer in Transformer
16-Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
17-DeepViT Towards Deeper Vision Transformer
18-All Tokens Matter Token Labeling for Training Better Vision Transformers
19-Less is More Pay Less Attention in Vision Transformers
20-DYNAMIC TOKEN NORMALIZATION IMPROVES VISION TRANSFORMER
21-REGIONVIT REGIONAL-TO-LOCAL ATTENTION FOR VISION TRANSFORMERS
22-CROSSFORMER A VERSATILE VISION TRANSFORMER HINGING ON CROSS-SCALE ATTENTION
23-CSWin Transformer A General Vision Transformer Backbone witH Cross-ShapedWindows
24-MPViT Multi-Path Vision Transformer for Dense Prediction
25-The Principle of Diversity Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy
26-Beyond Fixation Dynamic Window Visual Transformer
27-MixFormer Mixing Features across Windows and Dimensions
28-Vision Transformer with Deformable Attention
29-Swin Transformer V2 Scaling Up Capacity and Resolution
30-MSG-Transformer Exchanging Local Spatial Information by Manipulating Messenger Tokens
31-Nominate Synergistic Context in Vision Transformer
32-Shunted Self-Attention via Multi-Scale Token Aggregation
33-Improved Transformer-in-Transformer Baselines with Pyramid Architecture
34-Object-aware Mixing Layer for Vision Transformers
35-Unified Normalization for Accelerating and Stabilizing Transformers
36-Wave-ViT Unifying Wavelet and Transformers for Visual Representation Learning
37- Dual Attention Vision Transformers
38-Multi-Axis Vision Transformer
39-Learning Varied-Size Window Attention in Vision Transformers
40-Fast Vision Transformers with HiLo Attention
41-GPVIT A HIGH RESOLUTION NON-HIERARCHICAL VISION TRANSFORMER WITH GROUP PROPAGATION
42-CONDITIONAL POSITIONAL ENCODINGS FOR VISION TRANSFORMERS
43-LIPSFORMER INTRODUCING LIPSCHITZ CONTINUITY TO VISION TRANSFORMERS
44-BiFormer Vision Transformer with Bi-Level Routing Attention
45-Top-Down Visual Attention from Analysis by Synthesis
46-Visual Dependency Transformers Dependency Tree Emerges from Reversed Attention
47-ResFormer Scaling ViTs with Multi-Resolution Training
48-Vision Transformer with Super Token Sampling
49-PaCa-ViT Learning Patch-to-Cluster Attention in Vision Transformers
50-Global Context Vision Transformers
51-Foundation Transformers
52-Scale-Aware Modulation Meet Transformer
53-CrossFormer A Versatile Vision Transformer Hinging on Cross-scale Attention
54-Vision Transformer with Quadrangle Attention
---TRANSFORMER论文合集2-Efficient Vision Transformer
1-Training data-efficient image transformers & distillation through attention
10-Glance-and-Gaze Vision Transformer
11-DynamicViT Efficient Vision Transformers with Dynamic Token Sparsification
12-ResT An Efficient Transformer for Visual Recognition
13-SOFT Softmax-free Transformer with Linear Complexity
14-Evo-ViT Slow-Fast Token Evolution for Dynamic Vision Transformer
15-Pale Transformer A General Vision Transformer Backbone with Pale-Shaped Attention
16-When Shift Operation Meets Vision Transformer An Extremely Simple Alternative to Attention Mechanism
17-NOT ALL PATCHES ARE WHAT YOU NEED EXPEDITING VISION TRANSFORMERS VIA TOKEN REORGANIZATIONS
18-QUADTREE ATTENTION FOR VISION TRANSFORMERS
19-ANTI-OVERSMOOTHING IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS - FROM THEORY TO PRACTICE
2-ConViT Improving Vision Transformers with Soft Convolutional Inductive Biases
20-Learned Queries for Efficient Local Attention
21-Lite Vision Transformer with Enhanced Self-Attention
22-A-ViT Adaptive Tokens for Efficient Vision Transformer
23-Reversible Vision Transformers
24-Adaptive Token Sampling For Efficient Vision Transformers
25-EdgeViTs Competing Light-weight CNNs on Mobile Devices with Vision Transformers
26-Sliced Recursive Transformer
27-Self-slimmed Vision Transformer
28-M3ViT Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design
29-ResT V2 Simpler, Faster and Stronger
3-Scalable Vision Transformers with Hierarchical Pooling
30-EfficientFormer Vision Transformers at MobileNet Speed
31-GhostNetV2 Enhance Cheap Operation with Long-Range Attention
32-Peeling the Onion Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training
33-TOKEN MERGING YOUR VIT BUT FASTER
34-HiViT Hierarchical Vision Transformer Meets Masked Image Modeling
35-Making Vision Transformers Efficient from A Token Sparsification View
36-SparseViT Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
37-Slide-Transformer Hierarchical Vision Transformer with Local Self-Attention
38-RIFormer Keep Your Vision Backbone Effective But Removing Token Mixer
39-EfficientViT Memory Efficient Vision Transformer with Cascaded Group Attention
4-CrossViT Cross-Attention Multi-Scale Vision Transformer for Image Classification
40-Castling-ViT Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
41-RGB no more Minimally-decoded JPEG Vision Transformers
42-Learned Thresholds Token Merging and Pruning for Vision Transformers
5-Multi-Scale Vision Longformer A New Vision Transformer for High-Resolution Image Encoding
6-Visformer The Vision-friendly Transformer
7-Multi-Exit Vision Transformer for Dynamic Inference
8-Chasing Sparsity in Vision Transformers An End-to-End Exploration
9-Dynamic Grained Encoder for Vision Transformers