Evaluating the Performance and Scalability o fan Apache Spark and Hadoop Cluster in a Low-Cost Environment

Main Article Content

Natalie Cruz Tumba
Alex Lancho Ramos
Henry Leon Hurtado
Rafael Ricardo Quispe Merma

Abstract

This article presents the design, configuration, and implementation of a distributed computing cluster using Apache Spark and Hadoop on Ubuntu Server 24.04.1 LTS. The architecture consists of a master node and multiple slave nodes connected to a local network via Ethernet. The installation, configuration, and performance testing process with PySpark are detailed. The results demonstrate that, while a local configuration is more efficient for small datasets (<100 MB), the distributed cluster offers significant improvements for data volumes greater than 1 GB, validating its scalability and viability for resource-constrained educational and research environments.

Article Details

How to Cite
Evaluating the Performance and Scalability o fan Apache Spark and Hadoop Cluster in a Low-Cost Environment. (2025). C&T Riqchary Science and Technology Research Magazine, 7(2), 49-53. https://doi.org/10.57166/riqchary.v7.n2.2025.6
Section
Artículos

How to Cite

Evaluating the Performance and Scalability o fan Apache Spark and Hadoop Cluster in a Low-Cost Environment. (2025). C&T Riqchary Science and Technology Research Magazine, 7(2), 49-53. https://doi.org/10.57166/riqchary.v7.n2.2025.6

References

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communica-tions of the ACM, vol. 51, no. 1, pp. 107-113, Jan. 2008.

M. Zaharia et al., "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," NSDI '12: Proceedings of the 9th USE-NIX Conference on Networked Systems Design and Im-plementation, pp. 15-28, Apr. 2012.

M. Zaharia et al., "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, Nov. 2016.

K. Shvachko et al., "The Hadoop Distributed File System," IEEE 26th Symposium on Mass Storage Sys-tems and Technologies, pp. 1-10, May 2010.

M. Zaharia et al., "Spark: Cluster Computing with Working Sets," HotCloud '10: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10-10, Jun. 2010.

S. Ghemawat, H. Gobioff, and S. T. Leung, "The Google File System," ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29-43, Oct. 2003.

R. Lämmel, "Google's MapReduce Programming Model — Revisited," Science of Computer Program-ming, vol. 70, no. 1, pp. 1-30, Jan. 2008.

J. Ekanayake et al., "Twister: A Runtime for Iterative MapReduce," HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distribut-ed Computing, pp. 810-818, Jun. 2010.

T. White, Hadoop: The Definitive Guide, 4th ed. Sebas-topol, CA: O'Reilly Media, 2015.

C. Ranger et al., "Evaluating MapReduce for Multi-core and Multiprocessor Systems," HPCA '07: Pro-ceedings of the 13th International Symposium on High Performance Computer Architecture, pp. 13-24, Feb. 2007.

V. K. Vavilapalli et al., "Apache Hadoop YARN: Yet Another Resource Negotiator," SoCC '13: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1-16, Oct. 2013.

Ubuntu Documentation Team, "Ubuntu Server Guide," Canonical Ltd., 2024. [Online]. Available: https://ubuntu.com/server/docs

S. Ryza et al., Advanced Analytics with Spark: Patterns for Learning from Data at Scale, 1st ed. Sebastopol, CA: O'Reilly Media, 2015.

A. S. Tanenbaum and D. J. Wetherall, Computer Net-works, 5th ed. Boston, MA: Pearson, 2011.

P. Boncz et al., "Breaking the Memory Wall in MonetDB," Communications of the ACM, vol. 51, no. 12, pp. 77-85, Dec. 2008.

Apache Software Foundation, "Apache Hadoop Doc-umentation," 2024. [Online]. Available: https://hadoop.apache.org/docs/stable/

T. Condie et al., "MapReduce Online," NSDI '10: Pro-ceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, pp. 21-21, Apr. 2010.

Apache Software Foundation, "Apache Spark Docu-mentation," 2024. [Online]. Available: https://spark.apache.org/docs/latest/

M. Isard et al., "Dryad: Distributed Data-parallel Programs from Sequential Building Blocks," EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys Eu-ropean Conference on Computer Systems, pp. 59-72, Mar. 2007.

H. Karau et al., Learning Spark: Lightning-Fast Big Da-ta Analysis, 1st ed. Sebastopol, CA: O'Reilly Media, 2015.