Access to S3
Configuring access to S3
There are a few configuration keys that significantly impact performance
when accessing S3 (or S3-compatible) storage.
We recommend setting at least the following configuration keys
in core-site.xml and hive-site.xml.
Note that we place hive.mv.files.thread in core-site.xml, not in hive-site.xml,
because it is useful only for S3.
vi core-site.xml
<property>
<name>fs.s3a.connection.maximum</name>
<value>2000</value>
</property>
<property>
<name>fs.s3.maxConnections</name>
<value>2000</value>
</property>
<property>
<name>fs.s3a.threads.max</name>
<value>100</value>
</property>
<property>
<name>fs.s3a.threads.core</name>
<value>100</value>
</property>
<!-- S3 write performance -->
<property>
<name>hive.mv.files.thread</name>
<value>15</value>
</property>
<property>
<name>fs.s3a.max.total.tasks</name>
<value>5</value>
</property>
<property>
<name>fs.s3a.blocking.executor.enabled</name>
<value>false</value>
</property>
<!-- S3 input listing -->
<property>
<name>mapreduce.input.fileinputformat.list-status.num-threads</name>
<value>50</value>
</property>
vi hive-site.xml
<!-- S3 input listing -->
<property>
<name>hive.exec.input.listing.max.threads</name>
<value>50</value>
</property>
<!-- MSCK (Metastore Check) -->
<property>
<name>hive.metastore.fshandler.threads</name>
<value>30</value>
</property>
<property>
<name>hive.msck.repair.batch.size</name>
<value>3000</value>
</property>
<!-- dynamic partition query -->
<property>
<name>hive.load.dynamic.partitions.thread</name>
<value>25</value>
</property>
Vectorized reading in ORC 2
As Hive on MR3 uses ORC 2,
the user can adjust configuration keys for vectorized reading from S3.
For example,
increasing the values of the following configuration keys
(from their default values of 4K and 1M) typically results in
fewer S3 requests (such as s3.GetObject operations) and larger data sizes per request.
vi core-site.xml
<property>
<name>fs.s3a.vectored.read.min.seek.size</name>
<value>512K</value>
</property>
<property>
<name>fs.s3a.vectored.read.max.merged.size</name>
<value>4M</value>
</property>