This article consists of three main parts. We start with the principle of general data compression since it can directly apply to our subject. Then we focus on the audio compression or encoding. The last part is an overview of the audio formats.
Data compression is a process to reduce the number of bits needed to store or transmit data. The compression is either lossless or lossy. Losslessly compressed data can be recovered or decompressed to exactly the original value. No information is lost in the process. Lossy compression, however, does lose certain unimportant information, but gains higher reduction rate.
The data volume of digitized information, such as digital video and audio signals, can be extremely large. They take too much storage and would swamp even the fastest network connections during transmission. In fact, it would be difficult to run practical applications without effective compression. Therefore, data compression has become a key technology in today's digital world including communications, broadcasting, storage and multimedia entertainment.
For example, suppose we have a 100-minutes high-definition movie with a resolution of 1920x1080, at 24 frames per second, uncompressed, each pixel takes 3 bytes to store the red, green and blue intensities. In total, the storage space required is 3*1920*1080*24*60*100 bytes, which is approximately 830GB, and needs 16 dual layer blue ray discs to store it. That’s way too much!
Of course, in video streaming, the compression becomes even more useful because many consumers have very strict bandwidth limitations. Play an uncompressed 4K video file and you’d need a number of GB per second. Very few drives can even get there, let alone send it across a network.
Data compression is a must in many applications. It helps with the efficient use of your available storage / bandwidth.
There are often some segments repeatedly occurring in the data. They are redundant and can be removed or reduced in data encoding. Let's look at a simple example. Here are two methods to represent “the number 1 followed by 100, 000 zeros”. If all the zeros are printed out, it can form a small book. But it can be simply represented or encoded as 10 x (100000). The later is a compressed version which takes much less compacity to encode the same information.
Similarly, there are many data redundancy in multimedia information. For instance,
In short, the theoretical basis of compression is information theory. From the perspective of information, compression is to remove the redundancy in the information, that is, to remove the definite or inferable information, and to retain the uncertain information, to replace the original data with a smaller data set of the same or comprehensible information. It is not hard to imagine that there should be many data compression methods, each of which is based on either reversible redundancy compression or irreversible entropy compression. Redundancy compression is often used in disk files, data communications and meteorological satellite cloud images where no loss is allowed in the compression process. But its compression ratio is low, only a few times of the original, far from satisfying the requirements of multimedia applications.
In multimedia applications, almost everyone uses entropy compression technology. It is lossy, but has higher compression ratio. There are two main approaches in entropy compression: feature extraction and quantization. The fingerprint recognition is a typical example of the former, the latter is a more general entropy compression method.
The following figure shows the general flow of the compression process. It consists of Input Symbols, Compressor, Coded Symbols, Decompressor and Output Symbols.
The original input symbols are encoded by the compressor, and its output is the encoded data/symbols. Usually at a later time, the encoded data will be input to a decompressor, where the data is decoded, reconstructed, and the original data is output in the form of a symbol sequence.
If the output data and input data are always exactly the same, then this compression scheme is called lossless, also known as a lossless encoder. Otherwise, it is a lossy compression scheme.
There are lots of well-established data compression algorithms that could significantly reduce data size. However, they do have some limitations.
Digital audio compression, a specific application of general data compression theory, allows the efficient storage and transmission of audio data. It is widely used in multimedia applications and online audio/video streaming. The various audio compression techniques offer different levels of complexity, compressed audio quality, and amount of data compression.
Sound is basically a wave that is continually changed as respect to time. To make a computer process it easier, we turn it into a digital signal by sampling the audio input in regular, discrete intervals of time and quantizing the sampled values into a discrete number of evenly spaced levels. Thus, the digital audio data consists of a sequence of binary values representing the number of quantizer levels for each audio sample, and offers many advantages: high noise immunity, stability, and reproducibility. It also allows the efficient implementation of audio processing functions such as mixing, filtering, and equalization.
Digital signals are made up of samples. In order to accurately reproduce a sound wave of a frequency, we need to take or sample its amplitude values twice the frequency based on Nyquist–Shannon sampling theorem. For example, if you want to reproduce a sound with frequency of 20,000 Hz, you have to sample it at 40,000 Hz (or 44.1k for CD). Since most people lose the ability to hear sounds above 18,000 Hz by the time they are adults, the difference in sound of frequencies 44.1 kHz, 96 kHz and 192 kHz is almost negligible.
The sample rate is the sampling frequency, i.e., how many amplitude values of the sound we take per second. Each sample has a number of bits to represent its value. The bit depth or sample resolution is the maximum number of the bits. It affects how accurate a sound loudness or sample value can be represented. The higher bit depth means more levels of loudness and better approximation of the sample value.
Suppose that we have a sine sound wave which looks like a squiggly line, the peaks and valleys of a mountain. The bit depth determines how thinly you can slice the mountain vertically. The more slices or higher bit depth, the closer the curve of the original sound wave is preserved. The lower bit depth implies to cut the wave curve thickly, and end up with something more like stairs instead of the smooth natural curve of peaks and valleys in an analog waveform.
The 16-bit depth can represent 65,356 levels of sound loudness. Therefore, if one sample is slightly louder or softer than another one, the difference in their corresponding loudness levels is pretty slim. For the pair of samples, if low bit depth like 8 bit(256 levels of loudness) is used, the level difference would be much bigger, which sounds artificial with sounds jumping between two levels.
The 24-bit depth are capable of 16 million levels, which the average human ear simply isn’t precise enough to be able to tell the difference on mid-class consumer-grade hardware. But they may be useful for audio engineers for noise control.
Normally bit rate is the sample rate multiplied by the number of bits per sample. For compressed audio data, the bit rate is much smaller since the data size is significantly reduced. For instance, an mp3 file of 320kps has its bit rate 320,000 bits per second, but its sample rate is 44,100 samples per second and has 16 bits per sample, which translate to the bit rate: 705.6 kb/s(44.1k*16) or 1441.2 kb/s for a stereo file (two channels). So why do we have a bit rate that is more than four times less(1441.2k/4 > 320k)?
The answer is that mp3 files don't store the raw audio bit stream. Instead, the original audio data is encoded and compressed. This means that the same file gets the fewer bits while most of the same audio information are still kept. Along with this compressed data, the mp3 file contains the metadata: 44,100 samples per second and 16 bits per sample. Therefore, in transmission or storage, the bit rate is smaller and fewer bits are needed, but when decoded and sent to the listener, it's the original sample rate and bit depth/resolution are employed to recover the original audio sound.
In summary, the bit rate is simply the total number of bits of information stored per second of sound. It's a function of the sample rate, the bit resolution, and any kind of "compression" algorithm used to squish the information into a smaller space.
There are a variety of audio encoding algorithms. Among them are Pulse Code Modulation(PCM), A-law algorithm or the μ-law algorithm, Adaptive Pulse Code Modulation(APCM) and The Motion Picture Experts Group (MPEG) audio compression algorithms.
The PCM is a method for digitally encoding analog audio signal. The waveform is sampled regularly at equal intervals to generate a sequence of discrete analog amplitude values, each of which is quantized to the nearest value within a set of quantization levels. The number of the levels is determined by the bit depth, 16-bit depth means 64,000 levels, and the encoded audio data is a stream of 0 or 1, so-called, binary bits, which is uncompressed.
There are a suite of variants of the PCM, including Linear pulse-code modulation(LPCM), A-law algorithm or the μ-law algorithm and adaptive differential pulse code modulation (ADPCM). The PCM or its variants is the standard form of digital audio in computers, compact discs, digital telephony and other digital audio applications.
Linear pulse-code modulation (LPCM) is a specific type of PCM in which the quantization levels are sliced equally, as opposed to being the log of the amplitudes as with the A-law algorithm or the μ-law algorithm. It is so popular so that it is interchangeable with the PCM, though PCM is a more general term.
Commonly used bit depths for PCM are 8, 16, 20 or 24 bits per sample, and the sampling frequencies are 48 kHz as used in DVD videos, or 44.1 kHz as used in CDs. It is arguable that sampling rates, 192kHz or higher may not necessarily result in a superior listening experience due to the limitation of our human’s auditory system.
The LPCM is widely employed in many audio formats, including audio CD(Compact Disc), AES3, Au file format, raw audio, WAV, AC3 (Dolby Digital), MPEG-audio, and audio interchange file format (AIFF). It is also defined as a part of the DVD (1995) and Blue-Ray (2006) sound and video recording standards, and a group of other digital video and audio storage formats.
An A-law or μ-law algorithm is a variant of the PCM. The transformation is essentially logarithmic in nature and allows the 8 bits per sample output codes to cover a dynamic range equivalent to 14 bits of linearly quantized values. Both are 8-bit PCM defined in digital communication system standards, the A-law is for Europe and μ-law is for North America and Japan.
Motion Picture Experts Group (MPEG) audio encoding algorithm is a part of an International Organization for Standardization (ISO) standard for high fidelity audio compression. It can achieve transparent, perceptually lossless compression. During the development of the standard, the ISO/MPEG experts conducted a series of extensive subjective listening tests. The results indicated that even with a 6-to-1 compression ratio (stereo, 16-bit-per sample audio sampled at 48 kHz compressed to 256 kilobits per second) and under optimal listening conditions, expert listeners were unable to distinguish between coded and original audio clips with statistical significance, referring to Grewin and Ryden for the details of the setup, procedures, and results of these tests.
The high performance of the MPEG audio encoding algorithm is due to the psychoacoustic model used in the audio encoding process. In fact, the psychoacoustic model analyzes the input audio data and reduces or entirely discards certain parts of the audio as it is either not audible by human ears or masked by a louder sound. So it ends up with a lot less code bits to best represent the original audio information.
There are three audio compression layers: I, II and III, which are defined in MPEG-1, MPEG-2, and MPEG-2.5. The first two are the official ISO standards, and the third one is proprietary and unofficial. In total, there are basically nine encoding formats.
MPEG-1 Layer I has limited adoption in its time, and quickly became obsolete because of the great performance improvements in encoding algorithms. Layer I audio files typically use the extension ".mp1" or sometimes ".m1a".
MPEG-1 Layer II or MP2 provides high quality stereo sound at about 192 kbit/s. Decoding MP2 audio is computationally simple relative to MP3, AAC, etc.
MPEG-1/MPEG-2 Layer III or MP3 is the most popular and almost supported by every media player in desktops, smartphones or tablets. MP3 means either an encoding format or a file format depending on the context. As the encoding format, it was first defined in MPEG-1 audio layer III, later extended in MPEG-2 audio layer III to support lower sampling rates and bit rates. It was further expanded to support even lower sampling rates and bit rates in MPEG-2.5, though it has not become a standard. Its lossy encoding algorithms gain typically 75 to 95% reduction in size while preserving a good quality. As a file format, it is embedded with an elementary stream of MPEG-1 Audio or MPEG-2 Audio encoded data.
The MPEG-2 Audio enhances MPEG-1's audio by allowing the coding of audio programs with more than two channels, up to 5.1 multichannel. It also defined additional bit rates and sample rates for MPEG-1 Audio Layer I, II and III.
The last one is the MPEG 4 audio: advanced Audio Coding (AAC) and its variants, a part 3 of the MPEG-2 and MPEG-4 standards. It is also a lossy encoding format. It is designed to be the successor of the MP3 format , and generally achieves better sound quality than MP3 at the same bit rates, in particular, it is more superior in sound quality at 128k bps or lower bitrates. However, as bit rate increases, the efficiency of audio encoding algorithms becomes less important, and the intrinsic advantage AAC holds over MP3 no longer dominates audio quality. Other features include
There is a list of file extensions for AAC container formats, including
AAC is the default audio formats used in Apple’s products: iPhone, iPod and iPad. It is also employed by iTunes, Nintendo and Android phones.
Audio format defines an audio data layout and associated attributes or meta data. Every audio file in your computer is created according to a specific format. Usually, the audio format means either of the two: audio coding format or container format.
There are three major categories of audio file formats:
The most common uncompressed audio formats are LPCM,WAV, AIFF, AU, and BWF.
Popular formats in this category includes
MP3 is the most well-known audio file format, along with ACC (Apples iTunes format) and Ogg Vorbis. Other formats include Opus, Musepack, ATRAC and Windows Media Audio Lossy (WMA lossy).
Audio compression includes lossless and lossy algorithms. The lossless encoding gives you the best quality but needs large storage space and transmission bandwidth. The lossy one, on the other hand, achieves acceptable quality with substantial reduction in size. Common lossless audio formats include LPCM, FLAC, ALAC, WavPack and Monkey’s Audio, and lossy audio formats have MP3, AAC and Ogg Vorbis. The lossy encoding is usually more complicated and requires more computing power. The lossless audio is preferred for raw data archive and audio editing. They are both widely used for different purposes. The MP3 is the most popular lossy audio format and almost supported by every media player. It is often the target for audio format conversion.