Zlib, Snappy, and LZO for ORC
The default compression algorithm for ORC is Zlib which is the best choice in most cases.
ORC also provides built-in support for Snappy and LZO,
so the user does not have to install native libraries.
The user can override the default compression algorithm when creating ORC tables with the TBLPROPERTIES
keyword, as in:
...
STORED AS ORC TBLPROPERTIES("orc.compress"="SNAPPY")
Using Snappy for other formats
In order to use Snappy for other formats (e.g., SequenceFile), the user should install a native library for Snappy. The user can either include it when creating the Docker image or install it manually inside HiveServer2/DAGAppMaster/ContainerWorker Pods as follows:
$ yum install snappy.x86_64
$ cp /usr/lib64/libsnappy.so.1 /opt/mr3-run/mr3/mr3lib
The user should set the following configuration keys either in kubernetes/conf/hive-site.xml
or inside Beeline connections:
hive.exec.compress.intermediate
to truehive.intermediate.compression.codec
toorg.apache.hadoop.io.compress.SnappyCodec
hive.intermediate.compression.type
toBLOCK
hive.exec.compress.output
to truemapred.output.compression.codec
toorg.apache.hadoop.io.compress.SnappyCodec
mapred.output.compression.type
toBLOCK
Using LZO for other formats
In order to use LZO for other formats (e.g., SequenceFile), the user should install a native library for LZO. The user can either include it when creating the Docker image or install it manually inside HiveServer2/DAGAppMaster/ContainerWorker Pods as follows:
$ yum install -y lzo.x86_64 lzop.x86_64 wget
$ cp /usr/lib64/liblz4.so.1 /opt/mr3-run/hadoop/apache-hadoop/lib/native/
$ cp /usr/lib64/liblzo2.so.2 /opt/mr3-run/hadoop/apache-hadoop/lib/native/
$ wget https://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.20/hadoop-lzo-0.4.20.jar
$ cp hadoop-lzo-0.4.20.jar /opt/mr3-run/mr3/mr3lib/
Then the user should update kubernetes/conf/core-site.xml
to extend the value for the configuration key io.compression.codecs
and set the configuration key io.compression.codec.lzo.class
.
$ vi kubernetes/conf/core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
The user should also set the following configuration keys either in kubernetes/conf/hive-site.xml
or inside Beeline connections:
hive.exec.compress.output
to truemapred.output.compression.codec
tocom.hadoop.compression.lzo.LzoCodec
mapred.output.compression.type
toBLOCK