Performance Evaluation of Trino, Spark, and Hive on MR3
Introduction
In this article, we evaluate the performance of the following systems.
- Trino 418 (released on May 17, 2023)
- Spark 3.4.0 (released on Apr 13, 2023)
- Hive 3.1.3 on MR3 1.7 (released on May 15, 2023)
Trino and Hive on MR3 use Java 17, while Spark uses Java 8. For Hive on MR3, we also report the result of using Java 8. We do not use the fault-tolerant execution of Trino, whereas both Spark and Hive on MR3 are fault tolerant by design.
Experiment Setup
Cluster
For experiments, we use a cluster consisting of 1 master and 12 worker nodes with the following properties:
- 2 x Intel(R) Xeon(R) E5-2640 v4 @ 2.40GHz
- 256GB of memory
- 1 x 300GB HDD, 6 x 1TB SSDs
- 10 Gigabit network connection
In total, the amount of memory of worker nodes is 12 * 256GB = 3072GB. The cluster runs HDP (Hortonworks Data Platform) 3.1.4 and uses HDFS replication factor of 3.
TPC-DS benchmark
We use a variant of the TPC-DS benchmark introduced in the previous article which replaces an existing LIMIT clause with a new SELECT clause so that different results from the same query translate to different numbers of rows. The reader can find the set of modified TPC-DS queries in the GitHub repository.
The scale factor for the TPC-DS benchmark is 10TB.
For best performance, we use ORC for Trino and Hive on MR3, and Parquet for Spark.
Configuration
For all the experiments, the amount of memory allocated to JVM in each worker node (for Trino workers, Spark executors, and MR3 workers) is the same.
For Trino, we use a JVM option -Xmx196G
and choose the following configuration
after performance tuning:
memory.heap-headroom-per-node=58GB
query.max-total-memory=1680GB
query.max-memory-per-node=120GB
query.max-memory=1440GB
For Spark, we choose configuration parameters after performance tuning using a TPC-DS dataset of 1TB. For the amount of memory allocated to Spark executors, we use the following configuration:
spark.executor.memory=196g
spark.executor.memoryOverhead=20g
We use Spark shuffle service and enable advanced features such as dynamic partition pruning and adaptive query execution.
For Hive on MR3, we use the default configuration in the MR3 distribution except for the amount of memory allocated to workers.
Test
We sequentially submit 99 queries from the TPC-DS benchmark. We report the total running time, the geometric mean of running times, and the running time of each individual query.
In order to check the correctness, we report the number of rows from each query. If the result contains a single row, we report the sum of all numerical values in it.
Raw data of the experiment results
For the reader's perusal, we attach the table containing the raw data of the experiment results. Here is a link to [Google Docs].
Analysis
#1. Query completion
Spark and Hive on MR3 successfully complete all the queries,
whereas Trino fails to complete query 72
with an error message Query exceeded per-node memory limit of 120GB
.
This is a typical case in which Trino (or any MPP system without fault tolerance)
fails to complete a query because of lack of memory for holding intermediate data.
#2. Correctness
The three systems agree on the result with the following exceptions:
For query 21, Trino returns 3432 rows whereas Spark and Hive on MR3 return 3070 rows.- For both of the subqueries of query 23, Trino returns no rows whereas Spark and Hive on MR3 returns a single row with the same contents.
As all the systems under comparison have been developed independently,
we can conclude (with high confidence) that
Trino returns wrong answers on query 21 and query 23.
[ Update (June 3, 2023). The previous SQL script for query 21 included an integer division. In the case of Trino, the result of an integer division is always an integer. After casting the first operand to double, Trino returns the correct result of 3070 rows. Thank Alfred Davidson for reporting the error in the SQL script. ]
#3. Total running time
In terms of the total running time, Hive on MR3 is as fast as Trino.
- Trino finishes all the queries in 7424 seconds.
- Spark finishes all the queries in 19669 seconds.
- Hive on MR3 finishes all the queries in 7415 seconds.
Trino and Hive on MR3 far outperform Spark. Although Spark uses Java 8 in our experiment, running Spark with Java 17 is unlikely to yield a different conclusion because of a large gap in performance.
For fair comparison, one could use the fault-tolerant execution of Trino. In our experiment, however, Trino with fault-tolerant execution becomes much slower and even unstable. For example, when we use HDFS for an exchange manager, the first four queries of the TPC-DS benchmark produce the following results:
- Query 1 takes 35.225 seconds to complete (from 12.613 seconds).
- Query 2 takes 79.479 seconds to complete (from 29.143 seconds).
- Query 3 takes 169.337 seconds to complete (from 4.934 seconds).
- Query 4 fails after 396.986 seconds.
#4. Response time
In terms of the geometric mean of running times, Trino responds 21 percent faster than Hive on MR3 and 42 percent faster than Spark.
- On average, Trino finishes each query in 21.75 seconds.
- On average, Spark finishes each query in 37.76 seconds.
- On average, Hive on MR3 finishes each query in 27.68 seconds.
#5. Hive on MR3 using Java 17 vs Java 8
By switching from Java 8 to Java 17 (both using G1 GC), Hive on MR3 achieves about 8 percent of reduction in the total running time.
- With Java 17, the total running time is 7415 seconds.
- With Java 8, the total running time is 8074 seconds.
Conclusion
Since its initial release, Hive on MR3 has achieved significant improvements in performance, stability, and features. Despite being fault tolerant, Hive on MR3 runs almost as fast as Trino on the TPC-DS benchmark without failing or returning wrong results. Operating Hive on MR3 is also easy, whether on Hadoop or on Kubernetes. In fact, Hive on MR3 can run in any type of cluster by exploiting standalone mode (similarly to Trino and Spark standalone mode).