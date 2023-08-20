Large language models (LLMs) have seen advancements in several areas, such as text creation, few-shot learning, reasoning, and protein sequence modeling. These models can have hundreds of billions of parameters, which require complex deployment strategies and efficient inference techniques.

Cornell University researchers have conducted new research to enhance the performance of LLMs in real-world scenarios. They propose a technique called quantization with incoherence processing (QuIP), which involves two phases.

The first phase ensures that the weight and proxy Hessian matrices are incoherent by multiplying them with a Kronecker product of random orthogonal matrices. This pre- and post-processing step maintains incoherence in the matrices.

The second phase is an adaptive rounding procedure that minimizes a quadratic proxy objective of the error between the original weights and the quantized weights. It utilizes an estimate of the Hessian to achieve accurate rounding.

The researchers demonstrated that incoherence processing significantly improves large-model quantization, especially at higher compression rates. They were able to achieve usable results with only two bits per weight, making it the first LLM quantization approach to do so. Furthermore, they observed that the gaps between 2-bit and 4-bit compression decrease with larger LLM sizes, indicating the potential for accurate 2-bit inference in LLMs.

The research also included a theoretical analysis of the quantization algorithm’s scalability to LLM-sized models, investigating the impact of incoherence and comparing it to other rounding techniques. They found that QuIP without incoherence processing provides a more efficient implementation of the OPTQ technique.

However, the study did not account for interactions between transformer blocks or layers within a block in the proxy objective. The researchers acknowledged that including such interactions is an unknown factor at this scale and the computational effort required is uncertain.

Overall, the research highlights a promising approach to enhance the performance of large language models through quantization and incoherence processing. It opens up possibilities for more efficient and accurate inference in LLMs with reduced memory and computational requirements.