Why pigz freakin' rock(s)

I find myself quite often in the need to copy large textfiles over the network. Usually one would go with gzip, either transparently by using the compression switch on scp -C or by archiving a file before pushing it over the wire.

Turns out gzip can compress quite well, but it won't saturate your 100mbit line if you do something like this:

cat bigfile.txt | gzip -c | ssh me@other.side 'cat | gunzip -d > bigfile.txt'

While this has the same effect as scp bigfile.txt me@other.side: it will be helpful to understand the alternatives coming up next.

On a pretty decent machine (Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz) I could get about 60% saturation of the network link. So, what can we do? We could lower the compression level to take load of the CPU and shift it towards the network. We could also use an alternative compression algorithm such as LZO. "lzop" is a free implementation available in most common linux distributions, so this might be the easiest way to go:

cat bigfile.txt | lzop -c | ssh me@other.side 'cat | lzop -cd > bigfile.txt'

My initial tests shown that LZO compression level is about 20% lower than gzip's with default settings. Transfer time was almost cut in half on the other hand. So, how do we get 100% network saturation*AND* great compression? Pigz is a parallel gzip implementation, so instead maxing out only one thread as gzip does, it will use all the available cores and threads your fancy server provides. Downside? No debian stable repository packages available yet. But on the other hand: it does not even require a configure script, how about one header file and 2 "c" files. Most probably your remote connection to the server will take longer to refresh the console output than the compilation process itself.

So, emerge, apt-get install, port install or whatever "pv" and have some fun like this:

  1. me@host:/mnt/data1/import$ cat bigfile.txt | pv | pigz -c | ssh me@otherhost 'cat | unpigz > /mnt/data1/bigfile.txt'
  2. 1.83GB 0:00:18 [95.3MB/s] [ <=> ]

Some more numbers: gzip gives me 45 mb/s and lzop 60mb/s

I think "pv" stands for pipe view, it is responsible for the nice stats during the transfer. And yes, this *is* a 100mbit network connection pushing a textfile at ~100mb/s. Nice, isn't it?

You can find pigz over here

Comments

buntklicker.de's picture

Too many cats in the house

I'm sure it won't help much saturating your network connection, but I think

cat bigfile.txt | gzip -c | ssh me@other.side 'cat | gzip -cd > bigfile.txt'

has two cats too many.

gzip -c < bigfile.txt | ssh me@other.side 'gzip -cd > bigfile.txt'

works just as well and avoids two unnecessary processes, one on each side.

"cat" is the most over-used utility of all, because you almost never need it: Only when you actually want to concatenate stuff, or when you have no other processes. When you feel the urge to say "cat onefile | sometool", restrain yourself and say "sometool < onefile" instead. :-)

pulsar's picture

*like*

*like*