We use compressed data every day, whether it be music files (MP3), images (JPEG), video (MPEG-4) or mass-archive storage (ZIP). It is fascinating to watch a large amount of disk space get scrunched into a handy, tiny file. But how does this process of squeezing work?
In this three-part article, we will take you through some of the basics of data compression.
Part 1: Information:
How do we define information? How do we quantify the amount of information that any piece of data contains? In computers, we measure information in bits and bytes (or more accurately now Megabytes and gigabytes).
While this is surely a good way to ascertain the amount of space the data will take to store in a binary format, it lacks to quantify the real worth of the data. Your 100-page project document will probably take less space than a 2-minute commercial. This does nothing to validate the importance of the data.
When we deal with data outside of computers, we tend to look at it in a different way. We might consider it worthy to remember each and every detail of our project and might fail miserably, but we will be able to easily remember the commercial, no matter how little it means to you! While this might seem like an unimportant factor while considering data compression, it becomes quite important while dealing with data such as audio, video, and images.
Like all things, information can be relative, information for me, may be nothing for you. Taking a common example, news of your failing in Mathematics may be no news for you, but it might shock you that you failed in Geography. Which has more information here?
Redundancy
Consider a situation where you get a call from your employee every day that they will be late for work. The call gives you no new information; you already expect to get a call every day around the same time with the same message. However if on a particular day, you get a call saying that they will be on time, the information will suddenly have more meaning. You can see then that in case there is no NEW information, in case there is redundancy, there is a better and more efficient way exchange the same data.
Instead of calling every day with the same recycled excuses, you could construct a “dictionary” of excuses, and just cite one in a conversation. A simple call, “Excuse 42” would be enough, heck, even saying “Excuse” is redundant.
Data compression mechanisms tend to do the same thing, create a table of commonly occurring patterns, and replace the occurrences with something that takes less space. In essence, frequently occurring patterns are replaced with smaller codes while the infrequently occurring patterns might even take up more space than they initially did! However if the algorithm is properly designed, in the end, there is a net positive gain.
Packing Bytes
Try this: create a text file with a single sentence, and keep pasting it until the file is a few MB. This will take time! You will have to paste it thousands maybe hundreds of thousands of times! A better approach is to exponentially increase the amount of text you paste. Paste the sentence a couple of times, select all, and paste a couple of times again. Keep increasing the copy time and soon you will have a reasonably large file1. If you know a programming language you could perhaps write a script. When you compress this file using any program, you will see a drastic decrease in file size, because we have a small pattern repeating continuously. In fact, no matter how large the input file, the final filesize will be in the order of a few hundred bytes!
The aim of compression algorithms is to pack in more information in the same amount of space, to reduce the redundancy of the data. Compressed files have very low redundancy, and it is because of this reason that compressing an already compressed file will not give much benefit. You will not even get any benefit from compressing a “ZIP” file as a “RAR” even though the uncompressed original file might have been better compressed with RAR. Similarly, an uncompressed BMP image or WAV audio file might get compressed quite a bit, while a JPG image or MP3 file will barely benefit from further compression.
Every compression algorithm uses some mechanism to detect patterns, and assign them codes based on how frequently they occur. It is in this selection of patterns and the choice of their codes that makes each compression algorithm unique.
Part 2: Data Explosion:
We often have no idea how much data is behind the content we consume. There is such an information explosion, that not even your uber-cool 2TB hard disk is enough. We may not know it, but some of the most significant data we use today is infeasible unless highly compressed.
Image Data
To see how much compression goes on behind the scenes, take a normal full colour image in BMP format. The image will take up 24-bits per pixel, which is 3 bytes per pixel. This means that to find out the file size you multiply the number of pixels with 3 bytes. So a common 2-MP image will take up around 6MB uncompressed.
While this size may seem rather small considering the size of hard disks today, but for an image used on the internet, this size is exorbitant. The same image as a JPG, would take less than an MB, in all probability.
Now a look at videos.
Video Data
A simple VCD quality video, hardly worth a scoff in today's HD world, has a resolution of 352x288 (PAL), at 24-bit colour, playing at 25 frames per second. A single frame here would be around 352x288x3 bytes = 304128 bytes = 297 kb. Not much? Imagine that coming at you 25 times a second. That would be around, 7.25 MB per second, a single VCD worth of content (around 80 minutes) would take over 34 GB to store!
On comparing that to the actual size of a VCD disk -- around 700 MB -- we can see that what we have is a compression of over 48 times. This high compression is possible even with an outdated and inefficient codec such as MPEG 1. A careful observer may notice that the 700MB on a VCD also incudes audio, something we have not factored in, yet even with that included, we will only get a higher and more impressive compression ratio. Consider that today we have 1080p HD videos which have a resolution of up to 1920*1080, 20 times the resolution of a VCD.
Audio Data
Audio on the other hand is something we have all dealt with in an uncompressed form on audio CDs or even as WAV files on our hard disks. It is comparatively easier to deal with audio in its uncompressed form as they are of significantly lower size compared to video . To compare audio file sizes, an uncompressed audio file of around a minute will be as much as 10MB, while on compression, it is capable of reaching as little as 600 KB (AAC at 80 kbps) giving a ratio of as much as 17 times!
To see how we get to this figure, we take a one minute long 16-bit stereo audio sample recorded with a sampling frequency of 44.1 kHz. A 44.1 kHz sample means 44100 samples in a second, each of 16 bits (2 bytes). Multiplying it all, we get 44100 x 2 = 88200 bytes ~ 86 KB for a second of audio. So for a minute of audio we are looking at approximately 5MB of audio, and since we usually have a stereo audio track which has two channels of audio; we have in effect, 10MB of data per minute. Nowadays we have audio recorded at sampling rate as high as 192 kHz at 32-bit per sample, and with as many as 8-channels of audio (an overall increase of around 34 times)!
It becomes fairly obvious that multimedia is not something to be messed with in its uncompressed state!
Containing the explosion
Even if you do ignore the exorbitantly large sizes, there is another concern that needs be noted. With an HD video of 1080p reaching over a TB in size, how exactly is one to access such volume of data in the 2hrs needed to play it?! Uncompressed HD video is like opening as many as 30 BMP image of 1920*1080 resolution in a second, requiring a read speed of over 150 Megabytes per second for the video alone.
When dealing with such kind of data, even a compression ratio of three or four times is clearly not enough. While audio and images are still quite often compressed using lossless compression formats such as PNG for images and APE, FLAC etc. for audio, traditional lossless measures of compression become unfeasible when used on video. Lossless compression of video is only of use in media companies where the video needs to go through great amounts of editing and compositing before it is finally encoded into the format in which it will be distributed.
Additionally, in the case of audio and video, the compression format should be such that the entire file need not be decompressed just to play the media. All these requirements are clearly not met with our normal compression utilities which offer only lossless compression methods, and require decompression of the entire file before being used. Decompression speed is also very important since huge amounts of data are to be handled, and while it is possible to come up with some innovative compression methods, it is very important that the resultant file be playable on current computers
Besides audio and video, many other kinds of data require great volumes of storage space. Software distributed online is usually heavily packed, because increasing internet speeds, is no excuse for inefficiency. The popular VLC Media player for example, is a ~ 17MB download in its v1, yet after installation it expands to take up as much as 70MB! While such content needs to be compressed without loss of information, it is also not as important for the data content to have as fast a decompression.
Different kinds of data show redundancies in different places, an image might have an entire row or column of a single color, a video might have a few repeated frames, text might have some words appearing more frequently than others. Each data type requires a different approach to detect such redundancies, so that it may be stored in a minimal amount of space.
Part 3: Reducing Redundancy:
Many a new computer user might have wondered, "What if I keep compressing the compressed file?". Zip the rar which has been cab'd and 7zed? Wont you just get a simple clean 1 byte file you could store in that Floppy (You young readers might not know, but way back, almost 10 yeas ago, many people relied on small magnetic disks which could store 1.44MB)
As explained before, this isn't possible, since the aim of a compression algorithm is to juice out all redundancy. And the thing about redundancy is, once it's gone, it's gone.
Taking English text for example, the letter 'e' is the most common of all letters, occurring on an average, with a probability ~ 12.7% with the second 't' lagging far behind at ~ 9.1%.
However the thing is, how would you store an 'e' in less space that it currently takes? All a single character requires for storage is 1byte! This means that to save space everywhere an 'e' is used, we will have to store it in less than 1 byte!
In our normal encoding schemes, we represent each character as a sequence of bits, with each character having a unique sequence. Say 'A' is '01000001', 'B' is '01000010' and so on, we can get a code for each character. However, if every letter takes one byte, we will never be able to compress anything, since everything, no matter what the probability, will take the same amount of space.
In order to get some compression, we need to use a way of generating a code for each character such that the more common letters have a shorter code, and longer ones have a longer code.
Our common 'e', for example could get a nice, single bit '0', and 't' could get a '1'.
Even with this simple change, we can analyze the impact on the compression:
Both 'e' and 't' are now using a one bit code instead of an 8 bit code. So the compression for each presence of 'e' and 't' is 1/8 (12.5%).
This replacement will occur with a 12.7% + 9.1% frequency, i.e. 21.8% of all characters are either a 'e' or a 't'.
Therefore 21.8% of the text will be compressed to 1/8th it's size.
The final compression will be (1/8)*21.8 + 1*78.2% = 81%.
The file will be compressed to 81% of its original size! Pretty decent for just two replacements.
Of course it is impossible to compress text this way. For one, how would you discern the '0' and '1' of 'e' and 't' from the 0s and 1s which code the rest of the text? One way is to create a unique code which 'pads' each character, something like a space to differentiate between each word. This will mean there will be some restrictions in how each character is coded to.
Let us look at such an example:
In this example we use a HIGH bit ('1') to separate each code. This will mean that we CANNOT use a '1' as part of our character codes, as it will be mistaken for a character separator.
The code for the first most frequent characters in the English language:
Letter Compressed Code Ext. ASCII Code
e 1 01100101
t 10 01110100
a 100 01100001
o 1000 01101111
i 10000 01101001
As you can see, we now have a code which is variable in length, unique and identifiable, with each code padded with '1'.
Let us take the code for 'eat' :
Compressed code: 110010 -> 6 bytes
Original code: 11001010110000101110100 -> 24 bytes
Using this code has essentially resulted in a compression which reduces the size to a quarter of the original size.
The decoder here, will read in '110010' will decode it as follows:
Input first character: '1'
Sequence: '1'
Input second character: '1'
Encountered beginning of new character, decode previous sequence '1' as 'e'
Sequence: '1'
Input third character: '0'
Sequence: '10'
Input fourth character: '0'
Sequence: '100'
Input fifth character: '1'
Encountered beginning of new character, decode previous sequence '100' as 'a'
Sequence: '1'
Input sixth character: '0'
Sequence: '10'
End of input encountered, decode previous sequence '10' as 't'
Output: 'eat'
It is clear though that as we go further, the size of the characters code sequence will increase, till we reach 'h' with a priority of ~ 6.1% and a code of '10000000', which is 8 bits. After this point, the code sequence will become larger than the original 8 bits, and will cause an expansion in size rather than a compression. So:
'e' with 12.702% frequency, will be compressed to 1/8 its size.
't' with 09.056% frequency, will be compressed to 1/4 its size.
'a' with 08.167% frequency, will be compressed to 3/8 its size.
'o' with 07.507% frequency, will be compressed to 1/2 its size.
'i' with 06.966% frequency, will be compressed to 5/8 its size.
'n' with 06.749% frequency, will be compressed to 3/4 its size.
's' with 06.327% frequency, will be compressed to 7/8 its size.
After that there will only be an expansion. A little calculation will show you that these characters combined will be compressed to around 44.6% of their original size. This will be offset by the expansion encountered in the other characters.
This way is quite limiting an inefficient, and will yield very little compression in most real world cases. Real compression mechanisms use more complicated methods of creating a code for each character. Such as:
Letter Compressed Code
e 0
t 10
a 110
o 111
As you can see here, each sequence is unique, and so is the prefix of each code. The word 'eat' here will be encoded as '011010', again a 6bit code, with a 1/4th compression.
This will be decoded in the following manner:
Input first character: '0'
Sequence: '0'
Matches character 'e'
Input second character: '1'
Sequence: '1'
No match
Input third character: '1'
Sequence: '11'
No match
Input fourth character: '0'
Sequence: '110'
Matched character 'a'
Input fifth character: '1'
Sequence: '1'
No match
Input sixth character: '0'
Sequence: '10'
Matched character 't'
Output: 'eat'
If you compare the representation of 'eat' in the ASCII and the compressed code, you will notice that the encoded result '011010' has lesser sequences of repeated digits, than '11001010110000101110100'. The redundancy has been lost, and it is difficult to get a further increase in space savings by running the same compression scheme again. With a longer sequence you will notice that in the compressed file, the probability of occurrence of characters will be almost evenly distributed amongst all characters.
All this is great, however, these character frequencies / probabilities only apply on an average. An average text document might see decent compression with such a code, however 'Gadsby' wont get no love from such a compressor!
Compressors can overcome this by generating it's own dictionary of frequencies, and using those to assign codes. Even multi character sequences which are common can be assigned a much shorter code if they occur frequently enough.
Each compression mechanism has its own unique way of finding patterns, and assigning codes. A compressor which 'understands' the data will function better than one which doesn't and will be able to compress better, since it will know where to look for the patterns. As such you are bound to get better compression in an image file with the lossless PNG, than with a standard ZIP file. No matter how sophisticated the compressor, and what kind of data it deals with, it is most likely that it operates on these simple principles.
source: thinkdigit
Great post, please visit my blog
ReplyDelete