LLM 量化

Vision Transformers

Submitted Quantization Tittle Links Keys Model
2023.07.01 VVTQ Variation-aware Vision Transformer Quantization arXiv Github stars QAT DeiTSRetSwin
2023.05.21 Bi-ViT Bi-ViT: Pushing the Limit of Vision Transformer Quantization arXiv    
2023.05.18 GPUSQ-ViT Boost Vision Transformer with GPU-Friendly Sparsity and Quantization arXiv    
2023.05.11 PMQ Patch-wise Mixed-Precision Quantization of Vision Transformer arXiv    
2023.03.23 SQ-ViT Scaled Quantization for the Vision Transformer arXiv PTQ  
2023.03.22 Q-HyViT Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction arXiv PTQ  
2023.02.04 OFQ Oscillation-free Quantization for Low-bit Vision Transformers arXiv Github stars   DeiTSwin
2023.04.01 Q-DETR Q-DETR: An Efficient Low-Bit Quantized Detection Transformer arXiv    
2022.12.16 RepQ-ViT RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers arXiv PTQ  
2022.11.29 NoisyQuant NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers arXiv PTQ  
2022.11.17 CPT-V CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers arXiv    
2022.10.13 Q-ViT Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer arXiv Github stars   DeiT
2022.09.13 PSAQ-ViT V2 PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers arXiv Github stars   DeiTSwin
2022.07.04 I-ViT I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference arXiv    
2022.03.04 PSAQ-ViT Patch Similarity Aware Data-Free Quantization for Vision Transformers arXiv Github stars PTQ DeiTSwin
2022.01.19 Q-ViT Q-ViT: Fully Differentiable Quantization for Vision Transformer arXiv    
2022.10.10 APQ-ViT Towards Accurate Post-Training Quantization for Vision Transformer arXiv    
2021.11.27 FQ-ViT FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer arXiv Github stars PTQ DeiTViTSwin
2021.11.24 PTQ4ViT PTQ4ViT: Post-Training Quantization for Vision Transformers with Twin Uniform Quantization arXiv Github stars PTQ ViTDeitSwin
2021.05.27 ViT-quant Post-Training Quantization for Vision Transformer arXiv PTQ  

Language Transformers

Submitted Last Revised Quantization Tittle Links Keys
2023.06.13   SqueezeLLM Dense-and-Sparse Quantization arXiv Github stars PTQ
2023.06.05   SpQR A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression arXiv Github stars PTQ
2023.06.04 2023.06.13 OWQ Lessons learned from activation outliers for weight quantization in large language models arXiv Github stars PTQ
2023.06.01   AWQ Activation-aware Weight Quantization for LLM Compression and Acceleration arXiv Github stars PTQ
2023.05.30   PreQuant A Task-agnostic Quantization Approach for Pre-trained Language Models arXiv  
2023.05.29   LLM-QAT Data-Free Quantization Aware Training for Large Language Models arXiv  
2023.05.23   QLoRA Efficient Finetuning of Quantized LLMs arXiv Github stars QAT
2023.05.23   PEQA Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization arXiv  
2023.05.15 2023.05.26 ZeroQuant-V2 Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation arXiv Github stars PTQ
2023.04.18   Outlier Suppression+ Accurate quantization of large language models by equivalent and optimal shifting and scaling arXiv PTQ
2023.04.03 2023.05.17 RPTQ Reorder-based Post-training Quantization for Large Language Models arXiv Github stars PTQ
2022.11.18 2023.06.05 SmoothQuant Accurate and Efficient Post-Training Quantization for Large Language Models arXiv Github stars PTQ
2022.10.31 2023.03.22 GPTQ Accurate Post-Training Quantization for Generative Pre-trained Transformers arXiv Github stars PTQ
2022.09.27 2023.02.21 Outlier Suppression Pushing the Limit of Low-bit Transformer Language Models arXiv Github stars PTQ
2022.08.15 2022.11.10 LLM.int8() 8-bit Matrix Multiplication for Transformers at Scale arXiv Github stars  
2022.06.20 2023.04.15 LUT-GEMM Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models arXiv  
2022.06.04   ZeroQuant Efficient and Affordable Post-Training Quantization for Large-Scale Transformers arXiv Github stars PTQ
2022.05.25 2022.10.02 BiT Robustly Binarized Multi-distilled Transformer arXiv Github stars Extreme
2022.05.21 2022.07.16   Compression of Generative Pre-trained Language Models via Quantization arXiv  
2022.03.12   BiBERT Accurate Fully Binarized BERT arXiv Github stars Extreme
2021.09.30   MREM Towards Efficient Post-training Quantization of Pre-trained Language Models arXiv PTQ
2021.09.27   PEG-PTQ Understanding and Overcoming the Challenges of Efficient Transformer Quantization arXiv Github stars  
2021.06.02   SPIQA On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers arXiv Github stars  
2021.01.15   KDLSQ-BERT A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization arXiv  
2021.01.05 2021.05.08 I-BERT Integer-only BERT Quantization arXiv Github stars  
2020.12.31 2021.07.22 BinaryBERT Pushing the Limit of BERT Quantization arXiv Github stars Extreme
2020.09.27 2020.10.10 TernaryBERT Distillation-aware Ultra-low Bit BERT arXiv Github stars Extreme
2020.09.17 2020.09.18   Towards Fully 8-bit Integer Inference for the Transformer Model arXiv  
2020.09.16 2020.10.13   Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation arXiv  
2020.05.08 2020.09.27 GOBO Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference arXiv  
2019.10.14 2019.10.17 Q8BERT Quantized 8Bit BERT arXiv Github stars  
2019.09.12 2019.09.25 Q-BERT Hessian Based Ultra Low Precision Quantization of BERT arXiv  
2019.06.03 2019.06.07   Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model arXiv  

参考

Table of Contents