Data Compression
Raiting:
0

Comparison of the compression software for transferring large amounts of data


It all started as a simple task to download through 100 megabit network large amounts of data using rsync. The question arose whether it is possible to accelerate this process. Top utility showed that the encryption takes no more than 10 percent of the CPU in the source server, so it was decided to try to compress the data. Then it was unclear to me if there will be enough CPU performance to package data with the necessary speed, therefore there was set the smallest level of compression, namely it was used the flag - compress-level = 1 for rsync. It turned out that the CPU loading does not exceed 65%, namely there was enough CPU performance at that downloading speed of data has increased slightly.

After that, the question arose about the applicability of the analysis of accepted compression software for transferring large amounts of data over the network.

Collection of source data


The first task was to collect source data for the analysis. So *nix was chosen in the world of compression programs. All programs have been installed from the standard repositories, Ubuntu 11.10:
ProgramVersion Tested keysBy default
lzop 1.03 1, 3, 7-9 3
gzip 1.3.12 1-9 6
bzip2 1.0.5 1-9 9
xz5.0.00-9 7
Here are the official websites: lzop, gzip, bzip2, and xz.

bzip2 is based on the algorithm of BWT, while the rest are based on the algorithm LZ77 and its modifications.

lzop with the compression ratio of 6-2 are identical, and therefore they were not tested.

The test system consisted of a laptop with the Intel P8600 (Core 2 Duo, 2.4 GHz), and 4 GB of RAM running Ubuntu 11.10 64-bit.

There were tested several different data sets:
The source code of a large C ++ project openfoam.com, packed in tar. It contains the source code, documentation in pdf format, and some poorly compressible scientific data.

Uncompressed size: 127641600 bytes = 122 MB.
The binary distribution of a large project paraview.org, packaged in tar. It consists mainly of .so library files.

Uncompressed size: 392028160 bytes = 374 MB.
A set of binary scientific data.

Uncompressed size: 151367680 bytes = 145 MB.

And some more classic sets:
Kernel headers of Linux, packed in tar.

Uncompressed size: 56657920 bytes = 55 MB.
The source code of compiler gcc, packaged in tar.

Uncompressed size: 490035200 bytes = 468 MB.

The measurement procedure was as follows: Each program, compression ratio and the source file were compressed to a temporary file with the time measurement, and then the compressed file size was stored and unpacked with the time measurement as well. The unpacked file was compared with the source one just in case, and then the temporary and unpacked files were deleted.

In order to eliminate the influence of IO disk on the results of all the files that were placed in / run / shm (this is shared memory in Linux, which could be used as a file system).

Each measurement was done three times. The minimum time of three was used as a result.

The results of measurements


The results are shown in the graphs in the coordinates of the time packing - the size of the compressed file.

Fig. 1 shows the graph of the results for the source codes of compiler gcc:
image
Fig. 2 shows the graph of results for the binary distribution ParaView:
image
Fig. 3 shows the graph of results for a set of binary scientific data:
image
Also, here are graphs for the kernel headers and the source code of OpenFoam.

Observations:


lzop with the compression of 9-7 becomes very slow without the big compression file size, and it loses very much in comparison to gzip. These compression ratios are completely useless for lzop.

lzop is very fast at the default compression, as it was expected. The compression ratio1 is not much gain in time, giving a larger file size.
gzip performs predictably. It gives a proper form of the graph when you change the compression ratio.
bzip2 was found not very competitive. In most cases, it loses in compression to xz. The exceptions are the kernel headers and source code of gcc (that is purely textual information), in which they compete. Considering this, it is rather strange that bzip2 is more popular on the Internet than xz. This is probably due to the fact that the second one is quite young and has not gotten popular and standard yet.

It is also worth noting that different keys are changing the compression ratio and the operation time of bzip2 only slightly.
xz shows a high compression ratio at large values, but at the same time it works very slowly and requires a very large amount of memory for the compression (it is about 690 MB). The size of the memory for unpacking is much less, but it still can be limitation for some usages.

Low compression ratios come close to gzip -9 at the timing of operation, providing better compression ratio.

In general, we can see that xz gives a fairly wide range of compression ratios and time.
Unpacking is being done quite fast for the programs that are based on LZ-77, though with an increase in compression ratio (within the same program), the time of unpacking is significantly reduced.

bzip2 provides slower unpacking, though the time of unpacking increases with the compression ratio.

Analysis: Scenario 1 - sequential operations


Let us consider this scenario of data transferring over the network: first, the data are packaged, and then they are sent over the network and getting unpacked. So the total time will be:

t = t_c + s/bw + t_d

Where t_c is the packing time, s is a size of compressed file in bytes, bw is a bandwidth in bytes / s, and t_d is the unpacking time.

This equation defines a curve of constant time on the above charts, and the slope is determined by the bandwidth of the network. In order to determine the optimal compression algorithm for a given bandwidth, it is necessary to find the bottom line with the desired slope, passing through any point of the graph.

Namely, the optimal methods of compression and the speed of their switching are defined by the lines and tangents to our set of points on the graph. We can only identify them.

What we are looking for, this is nothing like the convex hull. To construct the convex hull of the points developed numerous algorithms. I used the implementation in Python here.

After constructing the convex hull of the points, belonging to it, will be our optimal algorithms, and the slopes of the edges that are connecting them, will be the boundary bandwidths.

Thus, we get the following optimal algorithms for our files:
image
Analysis: Scenario 2 - streaming compression

Now let us consider another scenario: the data is packed, and as soon as they are packed and sent over the network and the receiving side they are getting unpacked right away.

Then the total time of data transfer is determined by the maximum of three times: packing, unpacking, and sending over the network.

The curve of constant time for the selected network capacity will look like the angle with the top on the line.

t = s / bw

And two lines, which are reaching down and left from the top.
The algorithm that allows finding all the optimal ways for compressing and boundary bandwidths, in this case is much simpler. First, we sort the points in time, and then we select the first optimum point (this point will have no compression with zero time. No compression is optimal for an infinite network bandwidth). Now we iterate through the points in ascending order of time compression. If the file size is less than the optimal size of the previous method file, then the current method will become the new optimum. At the same time, the boundary bandwidth will be equal to the file size of previous method that is divided by the compression time of the new method.

Here are the results for this scenario to use:
image

Final remarks


Everything that is written above intended only for the scenarios with transferring large amounts of data, and it does not apply to other scenarios, such as sending many small messages.
This research has a model nature only, and it does not claim to have all the answers that arise in the real life. Here are some of them:

Leave out of account the IO disk.
Leave out of account the availability and the ability to use the multiple cores of a processor.
Leave out of account that there could be another load on the processor during the simultaneous use of server, leaving fewer resources for the data compression.

In the present research does not included Google's snappy. It happens for several reasons.

First of all, it does not have a native command line utility (actually, there is some snzip, but I do not think that it could be considered as any standard).
Secondly, it is intended for other uses.
Thirdly, it promises to be even faster than lzop, and therefore it could be very difficult to measure the time of its work correctly.

Practical examples


It is known that one of the fastest delivery channels of large sizes of information at short distances is a bicycle courier with hard drives :)

Let us do some math. Suppose the distance is 10 kilometers. The bicycle courier will make a round trip in an hour and a half (at a normal pace, taking into account any delay, loading and unloading) Let us say the bicycle courier will make 6 round trips per 8-hour workday, transporting 10 disks at 1 TB each. That will make 60 TB in total per workday. This is equivalent to the round clock line with a bandwidth of 60 TB / (24 * 3600) = 662 MB / s which is approximately 5 GB / s. It is not bad at all.

According to the obtained results we see that if you try to compress the data for such a link using one core that is similar to the cores of my laptop, the best would be the lack of compression at all.

And if we consider the case with the transfer of data on a single USB 2.0 of hard drive, it is similar to scenario 2 at a rate of 30 MB / sec, if we take into account the time spent only on the transmitting side.

Looking at our graph, we get: lzop -3 that is an optimal for moving data to an external USB of hard drive to our test data sets.

If we first compress and then copy to an external drive, this would be the scenario 1, and then the optimal time spent for the source codes and the binary distribution will be lzop -1, and for the poorly compressed binary data will be the lack of compression.

Here are some additional links


A similar comparison:
http://stephane.lesimple.fr/wiki/blog/lzop_vs_compress_vs_gzip_vs_bzip2_vs_lzma_vs_lzma2-xz_benchmark_reloaded

Another comparison:
http://www.linuxjournal.com/node/8051/
Siera 13 april 2012, 18:27
Vote for this post
Bring it to the Main Page
 

Comments

Leave a Reply

B
I
U
S
Help
Avaible tags
  • <b>...</b>highlighting important text on the page in bold
  • <i>..</i>highlighting important text on the page in italic
  • <u>...</u>allocated with tag <u> text shownas underlined
  • <s>...</s>allocated with tag <s> text shown as strikethrough
  • <sup>...</sup>, <sub>...</sub>text in the tag <sup> appears as a superscript, <sub> - subscript
  • <blockquote>...</blockquote>For  highlight citation, use the tag <blockquote>
  • <code lang="lang">...</code>highlighting the program code (supported by bash, cpp, cs, css, xml, html, java, javascript, lisp, lua, php, perl, python, ruby, sql, scala, text)
  • <a href="http://...">...</a>link, specify the desired Internet address in the href attribute
  • <img src="http://..." alt="text" />specify the full path of image in the src attribute