The world’s leading publication for data science, AI, and ML professionals. A deep dive into pruning, quantization, distillation, and other techniques to make your neural networks more efficient and easier to deploy. Whether you’re preparing for interviews or building Machine Learning systems at your job, model compression has become a must-have skill. In the era of LLMs, where models are getting larger and larger, the challenges around compressing these models to make them more efficient, smaller, and usable on lightweight machines have never been more relevant.In this article, I will go through four fundamental compression techniques that every ML practitioner should understand and master. I explore pruning, quantization, low-rank factorization, and Knowledge Distillation, each offering unique advantages. I will also add some minimal PyTorch code samples for each of these methods.I hope you enjoy the article!Pruning is probably the most intuitive compression technique. The idea is very simple: remove some of the weights of the network, either randomly or remove the “less important” ones. Of course, when we talk about “removing” weights in the context of neural networks, it means setting the weights to zero.Let’s start with a simple heuristic: removing weights smaller than a threshold.[ w’{ij} = \begin{cases} w{ij} & \text{if } |w_{ij}| \ge \theta_0 \0 & \text{if } |w_{ij}| < \theta_0\end{cases} ]Of course, this is not ideal because we would need to find a way to find the right threshold for our problem! A more practical approach is to remove a specified proportion of weights with the smallest magnitudes (norm) within one layer. There are 2 common ways of implementing pruning in one layer:We can also use global pruning with either of the two above methods. This will remove the chosen proportion of weights across multiple layers, and potentially have different removal rates depending on the number of parameters in each layer.PyTorch makes this pretty straightforward (by the way, you can find all code snippets in my GitHub repo).Note: if you have taken statistics classes, you probably learned regularization-induced methods that also implicitly prune some weights during training, by using L0 or L1 norm regularization. Pruning differs from that because it is applied as a post-Model Compression techniqueI would like to conclude that section with a quick mention of the Lottery Ticket Hypothesis, which is both an application of pruning and an interesting explanation of how removing weights can often improve a model. I recommend reading the associated paper ([7]) for more details.Authors use the following procedure:After doing this 30 times, you end up with only 0.930 ~ 4% of the original parameters. And surprisingly, this network can do as well as the original one.This suggests that there is important parameter redundancy. In other words, there exists a sub-network (“a lottery ticket”) that actually does most of the work! Pruning is one way to unveil this sub-network.While pruning focuses on removing parameters entirely, Quantization takes a different approach: reducing the precision of each parameter.Remember that every number in a computer is stored as a sequence of bits. A float32 value uses 32 bits (see example picture below), whereas an 8-bit integer (int8) uses just 8 bits.Most deep learning models are trained using 32-bit floating-point numbers (FP32). Quantization converts these high-precision values to lower-precision formats like 16-bit floating-point (FP16), 8-bit integers (INT8), or even 4-bit representations.The savings here are obvious: INT8 requires 75% less memory than FP32. But how do we actually perform this conversion without destroying our model’s performance?To convert from floating-point to integer representation, we need to map the continuous range of values to a discrete set of integers. For INT8 quantization, we’re mapping to 256 possible values (from -128 to 127).Suppose our weights are normalized between -1.0 and 1.0 (common in deep learning):[ \text{scale} = \frac{\text{float_max} – \text{float_min}}{\text{int8max} – \text{int8_min}} = \frac{1.0 – (-1.0)}{127 – (-128)} = \frac{2.0}{255} ] Then, the quantized value is given by[\text{quantized_value} = \text{round}(\frac{\text{original_value}}{\text{scale}} ] + \text{zero_point})Here, zero_point=0 because we want 0 to be mapped to 0. We can then round this value to the nearest integer to get integers between -127 and 128.And, you guessed it: to get integers back to float, we can use the inverse operation: [\text{float_value} = \text{integer_value} \times \text{scale} – \text{zero_point} ]Note: in practice, the scaling factor is determined based on the range values we quantize.Quantization can be applied at different stages and with different strategies. Here are a few techniques worth knowing about: (below, the word “activation” refers to the output values of each layer)Quantization is very flexible! You can apply different precision levels to different parts of the model. For instance, you might quantize most linear layers to 8-bit for maximum speed and memory savings, while leaving critical components (e.g. attention heads, or batch-norm layers) at 16-bit or full-precision.Now let’s talk about low-rank factorization — a method that has been popularized with the rise of LLMs.The key observation: many weight matrices in neural networks have effective ranks much lower than their dimensions suggest. In plain English, that means there is a lot of redundancy in the parameters.Note: if you have ever used PCA for dimensionality reduction, you have already encountered a form of low-rank approximation. PCA decomposes large matrices into products of smaller, lower-rank factors that retain as much information as possible.Take a weight matrix W. Every real matrix can be represented using a Singular Value Decomposition (SVD):[ W = U\Sigma V^T ]where Σ is a diagonal matrix with singular values in non-increasing order. The number of positive coefficients actually corresponds to the rank of the matrix W.To approximate W with a matrix of rank k < r, we can select the k greatest elements of sigma, and the corresponding first k columns and first k rows of U and V respectively:[ \begin{aligned} W_k &= U_k\,\Sigma_k\,V_k^T \[6pt] &= \underbrace{U_k\,\Sigma_k^{1/2}}{A\in\mathbb{R}^{m\times k}} \underbrace{\Sigma_k^{1/2}\,V_k^T}_{B\in\mathbb{R}^{k\times n}}. \end{aligned} ]See how the new matrix can be decomposed as the product of A and B, with the total number of parameters now being m * k + k * n = k(m+n) instead of mn! This is a huge improvement, especially when k is much smaller than m and n.In practice, it’s equivalent to replacing a linear layer x → Wx with 2 consecutive ones: x → A(Bx).We can either apply low-rank factorization before training (parameterizing each linear layer as two smaller matrices – not really a compression method, but a design choice) or after training (applying a truncated SVD on weight matrices). The second approach is by far the most common one and is implemented below.I think it’s crucial to mention LoRA: you have probably heard of LoRA (Low-Rank Adaptation) if you have been following LLM fine-tuning developments. Though not strictly a compression technique, LoRA has become extremely popular for efficiently adapting large language models and making fine-tuning very efficient.The idea is simple: during fine-tuning, rather than modifying the original model weights W, LoRA freezes them and learn trainable low-rank updates:\(W’ = W + \Delta W = W + AB\)where A and B are low-rank matrices. This allows for task-specific adaptation with just a fraction of the parameters. Even better: QLoRA takes this further by combining quantization with low-rank adaptation!Again, this is a very flexible technique and can be applied at various stages. Usually, LoRA is applied only on specific layers (for example, Attention layers’ weights).Knowledge distillation takes a fundamentally different approach from what we have seen so far. Instead of modifying an existing model’s parameters, it transfers the “knowledge” from a large, complex model (the “teacher”) to a smaller, more efficient model (the “student”). The goal is to train the student model to mimic the behavior and replicate the performance of the teacher, often an easier task than solving the original problem from scratch.Let’s explain some concepts in the case of a classification problem:In practice, it is pretty straightforward to train the student model. We combine the usual loss (standard cross-entropy loss based on hard labels) with the “distillation” loss (based on the teacher’s soft targets):\(L_{\text{total}} = \alpha L_{\text{hard}} + (1 – \alpha) L_{\text{distill}}\)The distillation loss is nothing but the KL divergence between the teacher and student distribution (you can see it as a measure of the distance between the 2 distributions).\(L_{\text{distill}} = D{KL}(q_{\text{teacher}} | | q_{\text{student}}) = \sum_i q_{\text{teacher}, i} \log \left( \frac{q_{\text{teacher}, i}}{q_{\text{student}, i}} \right)\)As for the other methods, it is possible and encouraged to adapt this framework depending on the use case: for example, one can also compare logits and activations from intermediate layers in the network between the student and teacher model, instead of only comparing the final outputs.Similar to the previous techniques, there are two options:And below, an easy way to apply offline distillation (the last code block of this article 🙂):Thanks for reading this article! In the era of LLMs, with billions or even trillions of parameters, model compression has become a fundamental concept, essential in almost every scenario to make models more efficient and easily deployable.But as we have seen, model compression isn’t just about reducing the model size – it’s about making thoughtful design decisions. Whether choosing between online and offline methods, compressing the entire network, or targeting specific layers or channels, each choice significantly impacts performance and usability. Most models now combine several of these techniques (check out this model, for instance). Beyond introducing you to the main methods, I hope this article also inspires you to experiment and develop your own creative solutions!Don’t forget to check out the GitHub repository, where you’ll find all the code snippets and a side-by-side comparison of the four compression methods discussed in this article.Check out my previous articles:Written ByTopics:Share this article:Step-by-step code guide to building a Convolutional Neural Network Here’s how to use Autoencoders to detect signals with anomalies in a few lines of… An illustrated guide on essential machine learning concepts Derivation and practical examples of this powerful concept Columns on TDS are carefully curated collections of posts on a particular idea or category… With demos, our new solution, and a video An illustrated guide to everything you need to know about Logistic Regression Your home for data science and Al. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.