Scheduling
This page provides a guide on setting up a scheduling policy for Hive on MR3. See DAG/Task Scheduling for an introduction.
To meet the needs of a particular environment,
the user may need to adjust the following configuration keys (all in mr3-site.xml).
mr3.dag.queue.schemefor assigning DAGs to Task queuesmr3.dag.priority.schemefor assign DAG prioritiesmr3.vertex.priority.schemefor updating Vertex prioritiesmr3.taskattempt.queue.schemefor choosing a scheme for scheduling Tasks
For capacity scheduling,
two other configuration keys mr3.dag.queue.capacity.specs and mr3.dag.queue.name are used.
The user can instead use hive.mr3.dag.queue.capacity.specs and hive.mr3.dag.queue.name
in hive-site.xml,
which are mapped to mr3.dag.queue.capacity.specs and mr3.dag.queue.name, respectively.
Common settings
For mr3.vertex.priority.scheme,
the default value postorder is usually the best choice.
Setting it to normalize can be useful
for allocating a roughly (but not perfectly) fair share of cluster resources to each query.
For mr3.taskattempt.queue.scheme,
the default value indexed is usually the best choice.
When LLAP I/O is enabled,
setting it to strict is recommended,
especially in public cloud environments where reading input data can be slow.
For batch-only environments
Since throughput is typically the primary concern for batch workloads,
set mr3.dag.queue.scheme to common and mr3.dag.priority.scheme to fifo.
For interactive-only environments
If each query should be allocated a strictly fair share of cluster resources,
set mr3.dag.queue.scheme to individual.
In this case, mr3.dag.priority.scheme can be ignored.
If not, set mr3.dag.queue.scheme to common and follow these recommendations:
- Set
mr3.dag.priority.schemetoconcurrentto minimize turnaround time. - If every user submits queries of similar characteristics,
mr3.dag.priority.schemecan be set tofifoto maximize throughput.
For mixed environments
It is best to enable capacity scheduling,
with mr3.dag.queue.scheme set to capacity,
where batch queries are routed to a Task queue with the lowest priority.
See DAG/Task Scheduling
for examples of setting mr3.dag.queue.capacity.specs (or hive.mr3.dag.queue.capacity.specs)
to configure capacity scheduling.
With capacity scheduling,
the user can set the configuration key hive.mr3.dag.queue.name to designate the Task queue
for each individual query.
In a cooperative environment where every user is allowed to use any Task queue, a single instance of HiveServer2 is sufficient.
In a more restrictve environment where each Task queue is associated with a certain level of privilege,
multiple instances of HiveServer2 are required.
The administrator should create a separate instance of HiveServer2 for each Task queue,
each configured with a fixed value for hive.mr3.dag.queue.name.
To enforce access control,
ordinary users should not be allowed to override the value of hive.mr3.dag.queue.name.
This can be easily achieved by including hive.mr3.dag.queue.name
in the value of the configuration key hive.conf.restricted.list.
Alternatively
the user can implement a custom Hive hook that inspects the value for hive.mr3.dag.queue.name.
For mr3.dag.priority.scheme, follow the guideline for interactive-only environments.