2023.06.13 | | SqueezeLLM | Dense-and-Sparse Quantization | | PTQ |
2023.06.05 | | SpQR | A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression | | PTQ |
2023.06.04 | 2023.06.13 | OWQ | Lessons learned from activation outliers for weight quantization in large language models | | PTQ |
2023.06.01 | | AWQ | Activation-aware Weight Quantization for LLM Compression and Acceleration | | PTQ |
2023.05.30 | | PreQuant | A Task-agnostic Quantization Approach for Pre-trained Language Models | | |
2023.05.29 | | LLM-QAT | Data-Free Quantization Aware Training for Large Language Models | | |
2023.05.23 | | QLoRA | Efficient Finetuning of Quantized LLMs | | QAT |
2023.05.23 | | PEQA | Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization | | |
2023.05.15 | 2023.05.26 | ZeroQuant-V2 | Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation | | PTQ |
2023.04.18 | | Outlier Suppression+ | Accurate quantization of large language models by equivalent and optimal shifting and scaling | | PTQ |
2023.04.03 | 2023.05.17 | RPTQ | Reorder-based Post-training Quantization for Large Language Models | | PTQ |
2022.11.18 | 2023.06.05 | SmoothQuant | Accurate and Efficient Post-Training Quantization for Large Language Models | | PTQ |
2022.10.31 | 2023.03.22 | GPTQ | Accurate Post-Training Quantization for Generative Pre-trained Transformers | | PTQ |
2022.09.27 | 2023.02.21 | Outlier Suppression | Pushing the Limit of Low-bit Transformer Language Models | | PTQ |
2022.08.15 | 2022.11.10 | LLM.int8() | 8-bit Matrix Multiplication for Transformers at Scale | | |
2022.06.20 | 2023.04.15 | LUT-GEMM | Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models | | |
2022.06.04 | | ZeroQuant | Efficient and Affordable Post-Training Quantization for Large-Scale Transformers | | PTQ |
2022.05.25 | 2022.10.02 | BiT | Robustly Binarized Multi-distilled Transformer | | Extreme |
2022.05.21 | 2022.07.16 | | Compression of Generative Pre-trained Language Models via Quantization | | |
2022.03.12 | | BiBERT | Accurate Fully Binarized BERT | | Extreme |
2021.09.30 | | MREM | Towards Efficient Post-training Quantization of Pre-trained Language Models | | PTQ |
2021.09.27 | | PEG-PTQ | Understanding and Overcoming the Challenges of Efficient Transformer Quantization | | |
2021.06.02 | | SPIQA | On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers | | |
2021.01.15 | | KDLSQ-BERT | A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization | | |
2021.01.05 | 2021.05.08 | I-BERT | Integer-only BERT Quantization | | |
2020.12.31 | 2021.07.22 | BinaryBERT | Pushing the Limit of BERT Quantization | | Extreme |
2020.09.27 | 2020.10.10 | TernaryBERT | Distillation-aware Ultra-low Bit BERT | | Extreme |
2020.09.17 | 2020.09.18 | | Towards Fully 8-bit Integer Inference for the Transformer Model | | |
2020.09.16 | 2020.10.13 | | Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation | | |
2020.05.08 | 2020.09.27 | GOBO | Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference | | |
2019.10.14 | 2019.10.17 | Q8BERT | Quantized 8Bit BERT | | |
2019.09.12 | 2019.09.25 | Q-BERT | Hessian Based Ultra Low Precision Quantization of BERT | | |
2019.06.03 | 2019.06.07 | | Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model | | |