Skip to main content

Configuring Tez Runtime

The behavior of Tez runtime is specified by the configuration file tez-site.xml in the classpath. MR3 inherits many configuration keys for Tez runtime from original Tez. For example, tez.runtime.io.sort.mb specifies the amount of memory required for sorting the output.

MR3 also introduces additional configuration keys which are specific to new features of MR3, and may interpret existing configuration keys in a different way.

Below we describe important configuration keys for Tez runtime as well as new configuration keys introduced in MR3.

VertexManager

NameDefault valueDescription
tez.shuffle-vertex-manager.enable.auto-parallelfalsetrue: enable auto parallelism for ShuffleVertexManager. false: disable auto parallelism. For more details, see Auto Parallelism.
tez.shuffle-vertex-manager.auto-parallel.min.num.tasks20Minimum number of Tasks to trigger auto parallelism. For example, if the value is set to 20, only those Vertexes with at least 20 Tasks are considered for auto parallelism. The user can effectively disable auto parallelism by setting this configuration key to a large value.
tez.shuffle-vertex-manager.auto-parallel.max.reduction.percentage10Percentage of Tasks that can be kept after applying auto parallelism. For example, if the value is set to 10, the number of Tasks can be reduced by up to 100 - 10 = 90 percent, thereby leaving 10 percent of Tasks.
tez.shuffle-vertex-manager.use-stats-auto-parallelismfalsetrue: analyze input statistics when applying auto parallelism. false: do not use input statistics.
tez.shuffle.vertex.manager.auto.parallelism.min.percent20Lower limit when normalizing input statistics. For example, if the value is set to 20, input statistics are normalized between 20 and 100. That is, an input size of zero is normalized to 20 while the maximum input size is mapped to 100.

Runtime

NameDefault valueDescription
tez.runtime.use.free.memory.fetched.inputfalsetrue: If the size of free memory exceeds the size of memory allocated to a single Task, fetchers use MemoryFetchedInput (for unordered data) and InMemoryMapOutput and (for ordered data) instead of spilling to local disks. false: Fetchers do not consider the size of free memory.
tez.runtime.use.free.memory.writer.outputfalsetrue: If the size of free memory exceeds the size of memory allocated to a single Task, tasks store their output in memory instead of writing to local disks. false: Tasks write their output to local disk. If set to true, set hive.mr3.delete.vertex.local.directory to true in hive-site.xml.
tez.runtime.pipelined-shuffle.enabledfalsetrue: Use pipelined shuffling. false: Do not use pipelined shuffling. If set to true/false, another configuration key tez.runtime.enable.final-merge.in.output is automatically set to false/true, respectively. Using speculative execution with pipelined shuffling is not recommended.
tez.runtime.pipelined.sorter.use.soft.referencefalsetrue: use soft references for ByteBuffers allocated in PipelinedSorter. These soft references are reused across TaskAttempts running in the same ContainerWorker. false: do not use soft references.
tez.runtime.transfer.data-via-events.enabledtruetrue: Embed unordered data directly in messages (of type DataMovementEvent). false: Do not embed. Effective only for Vertexes with a single output partition.
tez.runtime.transfer.data-via-events.max-size2048Maximum size of unordered data that can be directly embedded in messages and thus avoids creating fetchers.

Shuffle handlers

NameDefault valueDescription
tez.shuffle.connection-keep-alive.enablefalsetrue: keep connections alive for reuse. false: do not reuse.
tez.shuffle.max.threads0Number of threads for each shuffle handler. The default value of 0 sets the number of threads to 2 * the number of cores.
tez.shuffle.listen.queue.size128Size of the listening queue. Can be set to the value in /proc/sys/net/core/somaxconn.
tez.shuffle.port13563port number for shuffle handlers. If a ContainerWorker fails to secure a port number for a shuffle handler, it chooses a random port number instead.
tez.shuffle.mapoutput-info.meta.cache.size1000Size of meta data of the output of mappers.

ShuffleServer and fetchers

NameDefault valueDescription
tez.am.shuffle.auxiliary-service.idmapreduce_shuffleService ID for the external shuffle service. Set to tez_shuffle to use MR3 shuffle handlers. Must be set to tez_shuffle on Kubernetes and in standalone mode.
tez.shuffle.skip.verify.requestfalsetrue: MR3 shuffle handlers skip checking the validity of shuffle requests. false: MR3 shuffle handlers check the validity of shuffle requests. Effective only for MR3 shuffle handlers (with tez_shuffle).
tez.runtime.shuffle.keep-alive.enabledfalsetrue: keep connections alive for reuse in fetchers. false: do not reuse.
tez.runtime.shuffle.connect.timeout27500Maximum time in milliseconds for trying to connect to the shuffle service or the built-in shuffle handler before reporting fetch-failures. For more details, see Fault Tolerance.
tez.runtime.shuffle.parallel.copies20Maximum number of fetchers per LogicalInput. Note that a single RuntimeTask can create several LogicalInputs.
tez.runtime.shuffle.total.parallel.copies40Maximum number of fetchers per ContainerWorker.
tez.runtime.shuffle.fetch.max.task.output.at.once20Maximum number of task output files to fetch per fetch request. A large value can cause HTTP 400 errors.
tez.runtime.shuffle.max.input.hostports10000Maximum number of host-port combinations to cache for shuffling (to prevent memory-leak in public clouds with autoscaling).
tez.runtime.shuffle.ranges.schemefirstfirst: ShuffleServer selects randomly LogicalInput for shuffling. max (experimental): ShuffleServer selects LogicalInput with the most number of pending inputs.
tez.runtime.optimize.local.fetchtruetrue: Unordered data stored on local disks is directly read. false: Unordered data stored on local disks is read via fetchers. Automatically set to false when using memory-to-memory shuffling.
tez.runtime.optimize.local.fetch.orderedtruetrue: Ordered data stored on local disks is directly read. false: Ordered data stored on local disks is read via fetchers. Automatically set to false when using memory-to-memory shuffling.

Speculative fetching

NameDefault valueDescription
tez.runtime.shuffle.speculative.fetch.wait.millis30000Elapsed time threshold for a fetcher before triggering speculative fetching.
tez.runtime.shuffle.stuck.fetcher.threshold.millis3000Elapsed time threshold for a fetcher before triggering backpressure handling and blocking further connections to the shuffle handler.
tez.runtime.shuffle.stuck.fetcher.release.millis15000Elapsed time threshold after which backpressure handling is lifted, resuming the creation of fetchers that contact the previously blocked shuffle handler.
tez.runtime.shuffle.max.speculative.fetch.attempts2Maximum number of speculative fetchers for each fetch attempt.

Celeborn

The following configuration keys are effective when tez.celeborn.enabled is set to true and MR3 uses Celeborn as remote shuffle service. A configuration key of the form tez.celeborn.XXX.YYY is automatically converted to celeborn.XXX.YYY and passed to the Celeborn client.

NameDefault valueDescription
tez.celeborn.XXX.YYYConverted to celeborn.XXX.YYY to be read by Celeborn.
tez.runtime.celeborn.fetch.split.threshold1073741824Maximum size of data (in bytes) that a fetcher can receive from Celeborn workers. The default value is 1GB.
tez.runtime.celeborn.unordered.fetch.spill.enabledtruetrue: Reducers first write the output of mappers on local disks before processing. false: Reducers directly process the output of mappers fetched via unordered edges without writing to local disks.
tez.runtime.celeborn.client.fetch.throwsFetchFailuretruetrue: Throw Exceptions and thus triggers Task/Vertex reruns whenever fetch failures occur. false: Do not throw Exceptions.