Thursday, February 11, 2016

To ZIP or not to ZIP, that is the (web archiving) question

     This post looks at the question: Do you use uncompressed (W)ARC files? Many files on the Internet are already compressed and there is "little additional benefit gained from compressing these files again (it may even increase the size very slightly)."  For other files, such as text, tremendous storage savings can be realized using compression, usually about 60% of the uncompressed size. Compression has an effect on
disk or network access and on memory. But "the additional overhead of compressing a file, as it is written to disk, is trivial."

On the access side, the bottleneck is disk access but "compression can actually help!" It can save time and money and performance is barely affected. One exception may be with HTTP Range Requests which when accessing a WARC record would have to decompress the entire payload until it finds the requested item. A hybrid solution may be the best solution: "compress everything except files whose content type indicates an already compressed format."  This would also avoid a lot of unneeded compression / uncompression.

