Zlib, Snappy, and LZO for ORC

The default compression algorithm for ORC is Zlib which is the best choice in most cases. ORC also provides built-in support for Snappy and LZO, so the user does not have to install native libraries. The user can override the default compression algorithm when creating ORC tables with the TBLPROPERTIES keyword, as in:

... 
STORED AS ORC TBLPROPERTIES("orc.compress"="SNAPPY")

Using Snappy for other formats

In order to use Snappy for other formats (e.g., SequenceFile), the user should install a native library for Snappy. The user can either include it when creating the Docker image or install it manually inside HiveServer2/DAGAppMaster/ContainerWorker Pods as follows:

$ yum install snappy.x86_64
$ cp /usr/lib64/libsnappy.so.1 /opt/mr3-run/mr3/mr3lib

The user should set the following configuration keys either in kubernetes/conf/hive-site.xml or inside Beeline connections:

  • hive.exec.compress.intermediate to true
  • hive.intermediate.compression.codec to org.apache.hadoop.io.compress.SnappyCodec
  • hive.intermediate.compression.type to BLOCK
  • hive.exec.compress.output to true
  • mapred.output.compression.codec to org.apache.hadoop.io.compress.SnappyCodec
  • mapred.output.compression.type to BLOCK

Using LZO for other formats

In order to use LZO for other formats (e.g., SequenceFile), the user should install a native library for LZO. The user can either include it when creating the Docker image or install it manually inside HiveServer2/DAGAppMaster/ContainerWorker Pods as follows:

$ yum install -y lzo.x86_64 lzop.x86_64 wget
$ cp /usr/lib64/liblz4.so.1 /opt/mr3-run/hadoop/apache-hadoop/lib/native/ 
$ cp /usr/lib64/liblzo2.so.2 /opt/mr3-run/hadoop/apache-hadoop/lib/native/
$ wget https://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.20/hadoop-lzo-0.4.20.jar
$ cp hadoop-lzo-0.4.20.jar /opt/mr3-run/mr3/mr3lib/

Then the user should update kubernetes/conf/core-site.xml to extend the value for the configuration key io.compression.codecs and set the configuration key io.compression.codec.lzo.class.

$ vi kubernetes/conf/core-site.xml

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>

<property>
  <name>io.compression.codec.lzo.class</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

The user should also set the following configuration keys either in kubernetes/conf/hive-site.xml or inside Beeline connections:

  • hive.exec.compress.output to true
  • mapred.output.compression.codec to com.hadoop.compression.lzo.LzoCodec
  • mapred.output.compression.type to BLOCK