Evaluating the Performance and Scalability o fan Apache Spark and Hadoop Cluster in a Low-Cost Environment
Main Article Content
Abstract
This article presents the design, configuration, and implementation of a distributed computing cluster using Apache Spark and Hadoop on Ubuntu Server 24.04.1 LTS. The architecture consists of a master node and multiple slave nodes connected to a local network via Ethernet. The installation, configuration, and performance testing process with PySpark are detailed. The results demonstrate that, while a local configuration is more efficient for small datasets (<100 MB), the distributed cluster offers significant improvements for data volumes greater than 1 GB, validating its scalability and viability for resource-constrained educational and research environments.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
When an author creates an article and publishes it in a journal, the copyright passes to the journal as part of the publishing agreement. Therefore, the journal becomes the owner of the rights to reproduce, distribute and sell the article. The author retains some rights, such as the right to be recognized as the creator of the article and the right to use the article for his or her own scholarly or research purposes, unless otherwise agreed in the publication agreement.
How to Cite
References
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communica-tions of the ACM, vol. 51, no. 1, pp. 107-113, Jan. 2008.
M. Zaharia et al., "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," NSDI '12: Proceedings of the 9th USE-NIX Conference on Networked Systems Design and Im-plementation, pp. 15-28, Apr. 2012.
M. Zaharia et al., "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, Nov. 2016.
K. Shvachko et al., "The Hadoop Distributed File System," IEEE 26th Symposium on Mass Storage Sys-tems and Technologies, pp. 1-10, May 2010.
M. Zaharia et al., "Spark: Cluster Computing with Working Sets," HotCloud '10: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10-10, Jun. 2010.
S. Ghemawat, H. Gobioff, and S. T. Leung, "The Google File System," ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29-43, Oct. 2003.
R. Lämmel, "Google's MapReduce Programming Model — Revisited," Science of Computer Program-ming, vol. 70, no. 1, pp. 1-30, Jan. 2008.
J. Ekanayake et al., "Twister: A Runtime for Iterative MapReduce," HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distribut-ed Computing, pp. 810-818, Jun. 2010.
T. White, Hadoop: The Definitive Guide, 4th ed. Sebas-topol, CA: O'Reilly Media, 2015.
C. Ranger et al., "Evaluating MapReduce for Multi-core and Multiprocessor Systems," HPCA '07: Pro-ceedings of the 13th International Symposium on High Performance Computer Architecture, pp. 13-24, Feb. 2007.
V. K. Vavilapalli et al., "Apache Hadoop YARN: Yet Another Resource Negotiator," SoCC '13: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1-16, Oct. 2013.
Ubuntu Documentation Team, "Ubuntu Server Guide," Canonical Ltd., 2024. [Online]. Available: https://ubuntu.com/server/docs
S. Ryza et al., Advanced Analytics with Spark: Patterns for Learning from Data at Scale, 1st ed. Sebastopol, CA: O'Reilly Media, 2015.
A. S. Tanenbaum and D. J. Wetherall, Computer Net-works, 5th ed. Boston, MA: Pearson, 2011.
P. Boncz et al., "Breaking the Memory Wall in MonetDB," Communications of the ACM, vol. 51, no. 12, pp. 77-85, Dec. 2008.
Apache Software Foundation, "Apache Hadoop Doc-umentation," 2024. [Online]. Available: https://hadoop.apache.org/docs/stable/
T. Condie et al., "MapReduce Online," NSDI '10: Pro-ceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, pp. 21-21, Apr. 2010.
Apache Software Foundation, "Apache Spark Docu-mentation," 2024. [Online]. Available: https://spark.apache.org/docs/latest/
M. Isard et al., "Dryad: Distributed Data-parallel Programs from Sequential Building Blocks," EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys Eu-ropean Conference on Computer Systems, pp. 59-72, Mar. 2007.
H. Karau et al., Learning Spark: Lightning-Fast Big Da-ta Analysis, 1st ed. Sebastopol, CA: O'Reilly Media, 2015.