Algorithms
Raiting:
7

The simplest compression algorithms: RLE and LZ77


imageLong ago, when I was a naive schoolboy, I suddenly felt curious. What is a magical way that data takes up less space in an archive? I started surfing the Internet in order to find an answer. So I found many interesting articles with fairly detailed information on this topic. But all of them did not seem to me easy, the code listings were very difficult to understand.

The goal of this article is to give an idea about the simplest compression algorithms for people whose knowledge and experience so far don’t allow comprehending more professional publications. Namely, I will explain in the simple way about some of the simplest algorithms and give examples of their implementation.

Unfortunately, I will not discuss details about the implementation of coding process and things like effective search of the string occurrences. The article will give an idea only about the algorithms and ways of presenting results of their work.

RLE - compactness of uniformity

RLE algorithm is probably the simplest of all, its essence lies in the coding of iterations. In other words, we take a sequence of identical elements and pair them "number / value." For example, the string “AAAAAAAABCCCC” can be converted into an entry like "8×A, B, 4×C”. Generally, that’s all you need to know about the algorithm.

Example of implementation

Let’s suppose we have a certain set of integer coefficients, which can take values from 0 to 255. Using the logical way we have come to the conclusion that it is reasonable to keep this set as a byte array:
unsigned char data[] = {
0, 0, 0, 0, 0, 0, 4, 2, 0, 4, 4, 4, 4, 4, 4, 4,
80, 80, 80, 80, 0, 2, 2, 2, 2, 255, 255, 255, 255, 255, 0, 0
};

Most people are a much more familiar to see these data as a hex-dump:
0000: 00 00 00 00 00 00 04 02 00 04 04 04 04 04 04 04
0010: 50 50 50 50 00 02 02 02 02 FF FF FF FF FF 00 00

We thought that it is better to compress these sets. To do this, we analyzed them and found a pattern: there very often come across the subsequences, consisting of the identical elements. Of course, RLE can be very useful for this!

Let’s encode our data, using freshly knowledge: 6×0, 4, 2, 0, 7×4, 4×80, 0, 4×2, 5×255, 2×0.
It's time to present our results in readable form the computer. For this purpose, in the data stream somehow we must separate the single bytes from the encoded strings. Since the entire range of bytes are used by our data, and then just select any range of values for our goals will fail.

There are two ways out of this situation:
As an indicator of the compressed string to allocate one byte value and screen them in the event of an impact with the real data. For example, if value 255 is used for the service purposes, then at meeting of the value in the input data we are forced to write "255, 255" and after the indicator we must use maximum of 254.
To structure the encoded data indicating the number not only for repeatable, but then for the next single elements. Then we will know in advance where the data is.

The first method does not seem to be effective in this case, so we will use the second one.

So now we have two types of sequences: the chains of single elements (such as "4, 2, 0") and the chains of identical elements (such as "0, 0, 0, 0, 0, 0"). We extract in service bytes one bit for the sequence type: 0 - single elements, 1 - the same elements. Let’s take for this high bit of the byte.

We will store the length of the sequences in the remaining 7 bits, i.e. the maximum length of the encoded sequence is 127 bytes. We could separate for the service purposes two bytes, but in our case, these long sequences are extremely rare, so it's easier and more economical just to break them into shorter ones.

It turns out that we will first write the length of the sequence in the output stream, and then either one repeated value or a chain of non repeating elements of the specified length.

The first we have a couple of unused values in this situation. There cannot be zero-length sequences, so we can increase the maximum length up to 128 bytes, subtracting 1 from the length during encoding and adding when decoding. Thus, we can encode the lengths from 1 to 128 instead of the lengths from 0 to 127.

The second there are no sequences of identical elements of singular length. Therefore, we will subtract another 1 from the length value of such sequences during encoding; thereby increasing their maximum length to 129 (the maximum length of single elements chain is still equal to 128). That is, the chains of identical elements can have the length from 2 to 129.
Let’s encode our data again, but now it should be understandable form for the computer. We will write the service bytes as [T|L], where T is the type of sequence, and L is length. We will note immediately that the lengths are recorded in the changed form: at T=0 we subtract 1 from L, at T=1 we subtract 2.

[1|4], 0, [0|2], 4, 2, 0, [1|5], 4, [1|2], 80, [0|0], 0, [1|2], 2, [1|3], 255, [1|0], 0

Let's try to decode our result:
[1|4]. T=1, then the next byte will be repeated L +2 (4 + L+2 (4+2) times: 0, 0, 0, 0, 0, 0.
[0|2]. T=0, then we just read L+1 (2+1) bytes: 4, 2, 0.
[1|5]. T=1, we repeat the next byte 5+2 times: 4, 4, 4, 4, 4, 4, 4.
[1|2]. T=1, we repeat the next byte 2+2 times: 80, 80, 80, 80.
[0|0]. T=0, we read 0+1 bytes: 0.
[1|2]. T=1, we repeat 2+2 times: 2, 2, 2, 2.
[1|3]. T=1, we repeat 3+2 times: 255, 255, 255, 255, 255.
[1|0]. T=1, we repeat 0+2 times: 0, 0.

Now it is the final step: we store the result as the byte array. For example, the pair [1|4] that is packed into a byte would look like this:
image
As a result we get the following:
0000: 84 00 02 04 02 00 85 04 82 80 00 00 82 02 83 FF
0010: 80 00

So using this uncomplicated way in this example of input data, we received 18 out of 32 bytes. It is not bad at all.

Possible improvements

The algorithm effectiveness depends not only on the algorithm itself, but also on how it is implemented. Therefore, for different data there can be developed different variations of encoded data. For example, when encoding the image there can be made chains of variable length: to separate 1 bit for indication of a long chain, and if it is set to one, then the length should be stored in the next byte, too. So we give up short chains (65 elements instead of 129), but it allows using only three bytes to encode the chains of length up to 16385 elements (214 + 2)!

Additional efficiency can be achieved through the use of heuristic methods of encoding. For example, we encode using our way the following chain: “ABBA”. We get "[0|0], A, [1|0], B, [0|0], A” that is that we turned 4 bytes into 6! The more short rotating sequences of different types we have, the more redundant data is there.

LZ77 - brevity in the repetition

LZ77 is one of the simplest and most well-known in LZ algorithms family. It is named in honor of its founders: Abraham Lempel and Jacob Ziv. In the title 77 mean 1977, this is a year when an article was published, describing this algorithm.

The basic idea is to encode the same sequence of elements. That is, if in the input data occurs any chain of elements more than once, then all subsequent occurrences of it can be replaced with links to its first instance.

Like the rest algorithms, LZ77 uses a dictionary that stores previously encountered sequences. For this it applies the principle of the so-called "Sliding window". This window is a dynamic dictionary for a given algorithm, and two attributes corresponds to each element that is inside of it, they are the position in the window and the length.

Example of implementation

Now let’s try to encode something and generate some suitable string for this:

“The compression and the decompression leave an impression. Hahahahaha!”

Here's how it will look like in the memory (encoding ANSI):
0000: 54 68 65 20 63 6F 6D 70 72 65 73 73 69 6F 6E 20
0010: 61 6E 64 20 74 68 65 20 64 65 63 6F 6D 70 72 65
0020: 73 73 69 6F 6E 20 6C 65 61 76 65 20 61 6E 20 69
0030: 6D 70 72 65 73 73 69 6F 6E 2E 20 48 61 68 61 68
0040: 61 68 61 68 61 21

We have not decided on the size of the window yet, but we agree that it will be bigger that the size of the encoding string. Let's try to find all repeating symbol chains. The chain will be assumed the symbol sequence of length that is greater than one. If the chain is part of a long repeating chain, we will ignore it.

“The compression and t[he ]de[compression ]leave[ an] i[mpression]. Hah[ahahaha]!”

In order to have a greater clarity, let’s take a look at the diagram, which shows the repeating adequacy of sequences and their first entries:
image
Perhaps the only unclear thing here will be the “Hahahahaha!” sequence, because the short chain «ah» corresponds to the chain of «ahahaha» symbols. But there is nothing unusual, we have used some trick allowing the algorithm to work as described earlier RLE.

The fact is that when unpacking we will read out from a dictionary the specified number of symbols. Since the entire sequence is a periodic, i.e. the data is repeated with a certain period in it, and the symbols of the first period will be directly in front of the unpacking position, so using them we can recreate the whole chain, simply copying the symbols of the previous period to the next.
image
I guess it is clear so far. Now let’s replace the iterations that are found in the dictionary. We will write the link in the format of [P|L], where P is the position of first chain occurrence in a string, L is its length.

“The compression and t[22|3]de[5|12]leave[16|3] i[8|7]. Hah[61|7]!”

But do not forget that we are dealing with a sliding window. For a better understanding that links do not depend on the size of the window, let’s replace the absolute positions with the difference between them and the current position of the encoding.

“The compression and t[20|3]de[22|12]leave[28|3] i[42|7]. Hah[2|7]!”

Now it is sufficient to subtract P from the current position of encoding to get the absolute position in the string.

It's time to decide on the size of the window and a maximum length of the encoding phrase. Since we deal with a text, there rarely will be met the particularly long repetitive sequences. So let’s allocate 4 bits limit for 15 symbols for their length.

So now it does not depend on the window size how deeply we will search for the same chains. Since we deal with a small text, then it will be fine to add the number of bits up to two bytes: we will address the links in the range of 4096 bytes, using 12 bits.

We know that not all values can be used when dealing with RLE. Obviously, the link may have a minimum value of 1, therefore, to address back to the range 1… 4096, we have to subtract 1 from the link when encoding and when decoding we add 1 back. We do the same with the lengths of sequences: instead of 0…15 we will use a range of 2…17, as far as we don’t work with zero of the lengths, and the separate symbols are not sequences.

So, let’s imagine our encoded text considering these corrections:

“The compression and t[19|1]de[21|10]leave[27|1] i[41|5]. Hah[1|5]!”

Now, somehow we need to separate the compressed chains from the rest of the data. The most common way is to use again the structure and specify where is the compressed data, and where is not. To do this, we divide the encoded data in groups of eight elements (symbols or links), and before each of these groups we will insert a byte, where each bit corresponds to the type of element: 0 is for a symbol and 1 is for a link.

We break into the groups:
“The comp”
“ression”
“and t[19|1]de”
“[21|10]leave[27|1]”
“i[41|5]. Hah[2|5]”
“!”

Compose the groups:

“{0,0,0,0,0,0,0,0}The comp{0,0,0,0,0,0,0,0}ression {0,0,0,0,0,1,0,0}and t[19|1]de{1,0,0,0,0,0,1,0}[21|10]leave[27|1] {0,1,0,0,0,0,0,1}i[41|5]. Hah[1|5]{0}!”

Thus, if we meet at unpacking bit 0, then we simply read the symbol in the output stream, if the bit is 1, we read the link, and using link we read the sequence from the dictionary.

Now we just have to group the result into a byte array. We agree that we use the bits and the bytes in order from high to low. Let's see how the link bytes are packed using this example [19|1]:
image
Finally, our compressed stream will look like this:

0000: 00 54 68 65 20 63 6f 6d 70 00 72 65 73 73 69 6f #The comp#ressio
0010: 6e 20 04 61 6e 64 20 74 01 31 64 65 82 01 5a 6c n #and t##de###l
0020: 65 61 76 65 01 b1 20 41 69 02 97 2e 20 48 61 68 eave## #i##. Hah
0030: 00 15 00 21 ###!

Possible improvements

Generally, here everything will be fine that is described for the RLE. In particular, to demonstrate the use of heuristic approach during encoding, let’s consider the following example:

“The long goooooong. The loooooower bound.”

Let’s find the sequences for only the word “loooooower”:

“The long goooooong. The [lo][ooooo]wer bound.”

To encode such a result, we will need four bytes for the links. However, it would be more economical to do so:

“The long goooooong. The l[oooooo]wer bound.”

Then we would spend one byte less.

Conclusion

Despite its simplicity, these algorithms are still widely used in various fields of IT.

Their advantage are the simplicity and the speed, because on their principles and their combinations could be based more sophisticated and efficient algorithms.

I hope that I have explained the essence of these algorithms. Perhaps, it may help someone to understand the basics and start looking towards the more serious things.
KlauS 17 june 2012, 12:11
Vote for this post
Bring it to the Main Page
 

Comments

Leave a Reply

B
I
U
S
Help
Avaible tags
  • <b>...</b>highlighting important text on the page in bold
  • <i>..</i>highlighting important text on the page in italic
  • <u>...</u>allocated with tag <u> text shownas underlined
  • <s>...</s>allocated with tag <s> text shown as strikethrough
  • <sup>...</sup>, <sub>...</sub>text in the tag <sup> appears as a superscript, <sub> - subscript
  • <blockquote>...</blockquote>For  highlight citation, use the tag <blockquote>
  • <code lang="lang">...</code>highlighting the program code (supported by bash, cpp, cs, css, xml, html, java, javascript, lisp, lua, php, perl, python, ruby, sql, scala, text)
  • <a href="http://...">...</a>link, specify the desired Internet address in the href attribute
  • <img src="http://..." alt="text" />specify the full path of image in the src attribute