Quantization is reducing the number of bits to store a parameter for a machine learning model.
Put simple, a parameter is a number that determines how likely it is that something will occur, ie if the number is < 0.5 say "goodbye" otherwise say "hello".
Now, if the parameter is a 32bit (unsigned) integer it can have a value of 0-4,294,967,296.
If you were using this 32bit value to represent physical objects, then you could represent 4,294,967,296 objects (each object gets given its own number).
However a lot of the time in machine learning, after training you can find that not quite so many different "things" need to be represented by a particular parameter, so if say you were representing types of fruit with this parameter (Google says there are over 2000 types of fruit, but let's just say there are exactly 2000). In that case 4,294,967,296/2000 means there are 2.1 million distinct values we assign each fruit, which is such a waste! Our perfect case would be that we use a number that only represents 0-2000 in the smallest way for this job.
Now is where quantization comes in, where the size of the number we use to represent a parameter is reduced, saving memory size at the expense of a small performance hit of the model accuracy - it's known that many models don't really take a large accuracy hit from this, meaning that the way the parameter is used inside the model doesn't really need/take advantage of being able to represent so many values.
So what we do is say, reduce that 32bit number to 16, or 8, or 4 bits. We go from being able to represent billions or millions of distinct values/states to maybe 16 (with 4bit quantization) and then we benchmark the model performance against the larger version with 32bit parameters - often finding that what training has decided to use that parameter for doesn't really need an incredibly granular value.
Put simple, a parameter is a number that determines how likely it is that something will occur, ie if the number is < 0.5 say "goodbye" otherwise say "hello".
Now, if the parameter is a 32bit (unsigned) integer it can have a value of 0-4,294,967,296.
If you were using this 32bit value to represent physical objects, then you could represent 4,294,967,296 objects (each object gets given its own number).
However a lot of the time in machine learning, after training you can find that not quite so many different "things" need to be represented by a particular parameter, so if say you were representing types of fruit with this parameter (Google says there are over 2000 types of fruit, but let's just say there are exactly 2000). In that case 4,294,967,296/2000 means there are 2.1 million distinct values we assign each fruit, which is such a waste! Our perfect case would be that we use a number that only represents 0-2000 in the smallest way for this job.
Now is where quantization comes in, where the size of the number we use to represent a parameter is reduced, saving memory size at the expense of a small performance hit of the model accuracy - it's known that many models don't really take a large accuracy hit from this, meaning that the way the parameter is used inside the model doesn't really need/take advantage of being able to represent so many values.
So what we do is say, reduce that 32bit number to 16, or 8, or 4 bits. We go from being able to represent billions or millions of distinct values/states to maybe 16 (with 4bit quantization) and then we benchmark the model performance against the larger version with 32bit parameters - often finding that what training has decided to use that parameter for doesn't really need an incredibly granular value.