Tag Archives: Data compression

Computer Algorithms: Lossy Image Compression with Run-Length Encoding

Introduction

Run-length encoding is a data compression algorithm that helps us encode large runs of repeating items by only sending one item from the run and a counter showing how many times this item is repeated. Unfortunately this technique is useless when trying to compress natural language texts, because they don’t have long runs of repeating elements. In the other hand RLE is useful when it comes to image compression, because images happen to have long runs pixels with identical color.

As you can see on the following picture we can compress consecutive pixels by only replacing each run with one pixel from it and a counter showing how many items it contains.

Lossless RLE for Images
Although lossless RLE can be quite effective for image compression, it is still not the best approach!

In this case we can save only counters for pixels that are repeated more than once. Such the input stream “aaaabbaba” will be compressed as “[4]a[2]baba”.

Actually there are several ways run-length encoding can be used for image compression. A possible way of compressing a picture can be either row by row or column by column, as it is shown on the picture below.

Row by row or column by column compression
Row by row or column by column compression.
Continue reading Computer Algorithms: Lossy Image Compression with Run-Length Encoding

Computer Algorithms: Data Compression with Prefix Encoding

Overview

Prefix encoding, sometimes called front encoding, is yet another algorithm that tries to remove duplicated data in order to reduce its size. Its principles are simple, however this algorithm tend to be difficult to implement. To understand why, first let’s take a look of its nature.

Please, have a look on the following dictionary.

use
used
useful
usefully
usefulness
useless
uselessly
uselessness

Instead of keeping all these words in plain text or transferring all them over a network, we can compress (encode) them with prefix encoding.

Prefix Encoding

Continue reading Computer Algorithms: Data Compression with Prefix Encoding

Computer Algorithms: Data Compression with Relative Encoding

Overview

Relative encoding is another data compression algorithm. While run-length encoding, bitmap encoding and diagram and pattern substitution were trying to reduce repeating data, with relative encoding the goal is a bit different. Indeed run-length encoding was searching for long runs of repeating elements, while pattern substitution and bitmap encoding were trying to “map” where the repetitions happen to occur.

The only problem with these algorithms is that not always the input stream of data is constructed out of repeating elements. It is clear that if the input stream contains many repeating elements there must be some way of reducing them. However that doesn’t mean that we cannot compress data if there are no repetitions. It all depends on the data. Let’s say we have the following stream to compress.

1, 2, 3, 4, 5, 6, 7

We can hardly imagine how this stream of data can be compressed. The same problem may occur when trying to compress the alphabet. Indeed the alphabet letters the very base of the words so it is the minimal part for word construction and it’s hard to compress them.

Fortunately this isn’t true always. An algorithm that tryies to deal with non repeating data is relative encoding. Let’s see the following input stream – years from a given decade (the 90’s).

1991,1991,1999,1998,1991,1993,1992,1992

Here we have 39 characters and we can reduce them. A natural approach is to remove the leading “19” as we humans often do.

91,91,99,98,91,93,92,92

Now we have a shorter string, but we can go even further with keeping only the first year. All other years will as relative to this year.

91,0,8,7,0,2,1,1

Now the volume of transferred data is reduced a lot (from 39 to 16 – more than 50%). However there are some questions we need to answer first, because the stream wont be always formatted in such pretty way. How about the next character stream?

91,94,95,95,98,100,101,102,105,110

We see that the value 100 is somehow in the middle of the interval and it is handy to use it as a base value for the relative encoding. Thus the stream above will become:

-9,-6,-5,-5,-2,100,1,2,5,10

The problem is that we can’t decide which value will be the base value so easily. What if the data was dispersed in a different way.

96,97,98,99,100,101,102,103,999,1000,1001,1002

Now the value of “100” isn’t useful, because compressing the stream will get something like this:

-4,-3,-2,-1,100,1,2,3,899,900,901,902

To group the relative values around “some” base values will be far more handy.

(-4,-3,-2,-1,100,1,2,3)(-1,1000,1,2)

However to decide which value will be the base value isn’t that easy. Also the encoding format is not so trivial. In the other hand this type of encoding can be useful in som specific cases as we can see bellow.
Continue reading Computer Algorithms: Data Compression with Relative Encoding

Computer Algorithms: Data Compression with Diagram Encoding and Pattern Substitution

Overview

Two variants of run-length encoding are the diagram encoding and the pattern substitution algorithms. The diagram encoding is actually a very simple algorithm. Unlike run-length encoding, where the input stream must consists of many repeating elements, as “aaaaaaaa” for instance, which are very rare in a natural language, there are many so called “diagrams” in almost any natural language. In plain English there are some diagrams as “the”, “and”, “ing” (in the word “waiting” for example), “ a”, “ t”, “ e” and many doubled letters. Actually we can extend those diagrams by adding surrounding spaces. Thus we can encode not only “the”, but “ the “, which are 5 characters (2 spaces and 3 letters) with something shorter. In the other hand, as I said, in plain English there are two many doubled letters, which unfortunately aren’t something special for run-length encoding and the compression ratio will be small. Even worse the encoded text may happen to be longer than the input message. Let’s see some examples.

Let’s say we’ve to encode the message “successfully accomplished”, which consists of four doubled letters. However to compress it with run-length encoding we’ll need at least 8 characters, which doesn’t help us a lot.

// 8 chars replaced by 8 chars!?
input: 	"successfully accomplished"
output:	"su2ce2sfu2ly a2complished"

The problem is that if the input text contains numbers, “2” in particular, we’ve to chose an escape symbol (“@” for example), which we’ll use to mark where the encoded run begins. Thus if the input message is “2 successfully accomplished tasks”, it will be encoded as “2 su@2ce@2sfu@2ly a@2complished tasks”. Now the output message is longer!!! than the input string.

// the compressed message is longer!!!
input:	"2 successfully accomplished"
output:	"2 su@2ce@2sfu@2ly a@2complished tasks"

Again if the input stream contains the escape symbol, we have to find another one, and the problem is that it is often too difficult to find short escape symbol that doesn’t appear in the input text, without a full scan of the text. Continue reading Computer Algorithms: Data Compression with Diagram Encoding and Pattern Substitution

Computer Algorithms: Data Compression with Bitmaps

Overview

In my previous post we saw how to compress data consisting of very long runs of repeating elements. This type of compression is known as “run-length encoding” and can be very handy when transferring data with no loss. The problem is that the data must follow a specific format. Thus the string “aaaaaaaabbbbbbbb” can be compressed as “a8b8”. Now a string with length 16 can be compressed as a string with length 4, which is 25% of its initial length without loosing any information. There will be a problem in case the characters (elements) were dispersed in a different way. What would happen if the characters are the same, but they don’t form long runs? What if the string was “abababababababab”? The same length, the same characters, but we cannot use run-length encoding! Indeed using this algorithm we’ll get at best the same string.

In this case, however, we can see another fact. The string consists of too many repeating elements, although not arranged one after another. We can compress this string with a bitmap. This means that we can save the positions of the occurrences of a given element with a sequence of bits, which can be easily converted into a decimal value. In the example above the string “abababababababab” can be compressed as “1010101010101010”, which is 43690 in decimals, and even better AAAA in hexadecimal. Thus the long string can be compressed. When decompressing (decoding) the message we can convert again from decimal/hexadecimal into binary and match the occurrences of the characters. Well, the example above is too simple, but let’s say only one of the characters is repeating and the rest of the string consists of different characters like this: “abacadaeafagahai”. Then we can use bitmap only for the character “a” – “1010101010101010” and compress it as “AAAA bcdefghi”. As you can see all the example strings are exactly 16 characters and that is a limitation. To use bitmaps with variable length of the data is a bit tricky and it is not always easy (if possible) to decompress it.

Bitmap Compression
Basically bitmap compression saves the positions of an element that is repeated very often in the message!

Continue reading Computer Algorithms: Data Compression with Bitmaps