Scheduling
This page provides a guide on setting up a scheduling policy for Hive on MR3. See DAG/Task Scheduling for an introduction.
To meet the needs of a particular environment,
the user may need to adjust the following configuration keys (all in mr3-site.xml
).
mr3.dag.queue.scheme
for assigning DAGs to Task queuesmr3.dag.priority.scheme
for assign DAG prioritiesmr3.vertex.priority.scheme
for updating Vertex prioritiesmr3.taskattempt.queue.scheme
for choosing a scheme for scheduling Tasks
For capacity scheduling,
two other configuration keys mr3.dag.queue.capacity.specs
and mr3.dag.queue.name
are used.
The user can instead use hive.mr3.dag.queue.capacity.specs
and hive.mr3.dag.queue.name
in hive-site.xml
,
which are mapped to mr3.dag.queue.capacity.specs
and mr3.dag.queue.name
, respectively.
Common settings
For mr3.vertex.priority.scheme
,
the default value postorder
is usually the best choice.
Setting it to normalize
can be useful
for allocating a roughly (but not perfectly) fair share of cluster resources to each query.
For mr3.taskattempt.queue.scheme
,
the default value indexed
is usually the best choice.
When LLAP I/O is enabled,
setting it to strict
is recommended,
especially in public cloud environments where reading input data can be slow.
For batch-only environments
Since throughput is typically the primary concern for batch workloads,
set mr3.dag.queue.scheme
to common
and mr3.dag.priority.scheme
to fifo
.
For interactive-only environments
If each query should be allocated a strictly fair share of cluster resources,
set mr3.dag.queue.scheme
to individual
.
In this case, mr3.dag.priority.scheme
can be ignored.
If not, set mr3.dag.queue.scheme
to common
and follow these recommendations:
- Set
mr3.dag.priority.scheme
toconcurrent
to minimize turnaround time. - If every user submits queries of similar characteristics,
mr3.dag.priority.scheme
can be set tofifo
to maximize throughput.
For mixed environments
It is best to enable capacity scheduling,
with mr3.dag.queue.scheme
set to capacity
,
where batch queries are routed to a Task queue with the lowest priority.
See DAG/Task Scheduling
for examples of setting mr3.dag.queue.capacity.specs
(or hive.mr3.dag.queue.capacity.specs
)
to configure capacity scheduling.
With capacity scheduling,
the user can set the configuration key hive.mr3.dag.queue.name
to designate the Task queue
for each individual query.
In a cooperative environment where every user is allowed to use any Task queue, a single instance of HiveServer2 is sufficient.
In a more restrictve environment where each Task queue is associated with a certain level of privilege,
multiple instances of HiveServer2 are required.
The administrator should create a separate instance of HiveServer2 for each Task queue,
each configured with a fixed value for hive.mr3.dag.queue.name
.
To enforce access control,
ordinary users should not be allowed to override the value of hive.mr3.dag.queue.name
.
This can be easily achieved by including hive.mr3.dag.queue.name
in the value of the configuration key hive.conf.restricted.list
.
Alternatively
the user can implement a custom Hive hook that inspects the value for hive.mr3.dag.queue.name
.
For mr3.dag.priority.scheme
, follow the guideline for interactive-only environments.