2023.06.13 | | SqueezeLLM | Dense-and-Sparse Quantization |  | PTQ |
2023.06.05 | | SpQR | A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression |  | PTQ |
2023.06.04 | 2023.06.13 | OWQ | Lessons learned from activation outliers for weight quantization in large language models |  | PTQ |
2023.06.01 | | AWQ | Activation-aware Weight Quantization for LLM Compression and Acceleration |  | PTQ |
2023.05.30 | | PreQuant | A Task-agnostic Quantization Approach for Pre-trained Language Models |  | |
2023.05.29 | | LLM-QAT | Data-Free Quantization Aware Training for Large Language Models |  | |
2023.05.23 | | QLoRA | Efficient Finetuning of Quantized LLMs |  | QAT |
2023.05.23 | | PEQA | Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization |  | |
2023.05.15 | 2023.05.26 | ZeroQuant-V2 | Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation |  | PTQ |
2023.04.18 | | Outlier Suppression+ | Accurate quantization of large language models by equivalent and optimal shifting and scaling |  | PTQ |
2023.04.03 | 2023.05.17 | RPTQ | Reorder-based Post-training Quantization for Large Language Models |  | PTQ |
2022.11.18 | 2023.06.05 | SmoothQuant | Accurate and Efficient Post-Training Quantization for Large Language Models |  | PTQ |
2022.10.31 | 2023.03.22 | GPTQ | Accurate Post-Training Quantization for Generative Pre-trained Transformers |  | PTQ |
2022.09.27 | 2023.02.21 | Outlier Suppression | Pushing the Limit of Low-bit Transformer Language Models |  | PTQ |
2022.08.15 | 2022.11.10 | LLM.int8() | 8-bit Matrix Multiplication for Transformers at Scale |  | |
2022.06.20 | 2023.04.15 | LUT-GEMM | Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models |  | |
2022.06.04 | | ZeroQuant | Efficient and Affordable Post-Training Quantization for Large-Scale Transformers |  | PTQ |
2022.05.25 | 2022.10.02 | BiT | Robustly Binarized Multi-distilled Transformer |  | Extreme |
2022.05.21 | 2022.07.16 | | Compression of Generative Pre-trained Language Models via Quantization |  | |
2022.03.12 | | BiBERT | Accurate Fully Binarized BERT |  | Extreme |
2021.09.30 | | MREM | Towards Efficient Post-training Quantization of Pre-trained Language Models |  | PTQ |
2021.09.27 | | PEG-PTQ | Understanding and Overcoming the Challenges of Efficient Transformer Quantization |  | |
2021.06.02 | | SPIQA | On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers |  | |
2021.01.15 | | KDLSQ-BERT | A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization |  | |
2021.01.05 | 2021.05.08 | I-BERT | Integer-only BERT Quantization |  | |
2020.12.31 | 2021.07.22 | BinaryBERT | Pushing the Limit of BERT Quantization |  | Extreme |
2020.09.27 | 2020.10.10 | TernaryBERT | Distillation-aware Ultra-low Bit BERT |  | Extreme |
2020.09.17 | 2020.09.18 | | Towards Fully 8-bit Integer Inference for the Transformer Model |  | |
2020.09.16 | 2020.10.13 | | Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation |  | |
2020.05.08 | 2020.09.27 | GOBO | Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference |  | |
2019.10.14 | 2019.10.17 | Q8BERT | Quantized 8Bit BERT |  | |
2019.09.12 | 2019.09.25 | Q-BERT | Hessian Based Ultra Low Precision Quantization of BERT |  | |
2019.06.03 | 2019.06.07 | | Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model |  | |