Evaluación del Rendimiento y la Escalabilidad de un Clúster Apache Spark y Hadoop en un Entorno de Bajo Costo

Natalie Cruz Tumba; Alex Lancho Ramos; Henry Leon Hurtado; Rafael Ricardo Quispe Merma; Evelyn Naida Luque Ochoa

doi:10.57166/riqchary.v7.n2.2025.6

PDF (Spanish)

Published: Aug 19, 2025

DOI: https://doi.org/10.57166/riqchary.v7.n2.2025.6

Keywords:

Apache Spark, data processing, Hadoop, distributed cluster, distributed computing, low-cost cluster

Natalie Cruz Tumba

Micaela Bastidas National University of Apurímac

https://orcid.org/0009-0006-3110-2087

Alex Lancho Ramos

Micaela Bastidas National University of Apurímac

https://orcid.org/0009-0008-5493-397X

Henry Leon Hurtado

Micaela Bastidas National University of Apurímac

https://orcid.org/0009-0001-8216-1232

Rafael Ricardo Quispe Merma

Micaela Bastidas National University of Apurímac

https://orcid.org/0000-0002-8980-4560

Evelyn Naida Luque Ochoa

https://orcid.org/0000-0002-8386-9806

Abstract

This article presents the design, configuration, and implementation of a distributed computing cluster using Apache Spark and Hadoop on Ubuntu Server 24.04.1 LTS. The architecture consists of a master node and multiple slave nodes connected to a local network via Ethernet. The installation, configuration, and performance testing process with PySpark are detailed. The results demonstrate that, while a local configuration is more efficient for small datasets (<100 MB), the distributed cluster offers significant improvements for data volumes greater than 1 GB, validating its scalability and viability for resource-constrained educational and research environments.

How to Cite

Evaluating the Performance and Scalability o fan Apache Spark and Hadoop Cluster in a Low-Cost Environment. (2025). C&T Riqchary Science and Technology Research Magazine, 7(2), 49-53. https://doi.org/10.57166/riqchary.v7.n2.2025.6

Issue

Vol. 7 No. 2 (2025): COINCITEC 2025

Section

Artículos

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

When an author creates an article and publishes it in a journal, the copyright passes to the journal as part of the publishing agreement. Therefore, the journal becomes the owner of the rights to reproduce, distribute and sell the article. The author retains some rights, such as the right to be recognized as the creator of the article and the right to use the article for his or her own scholarly or research purposes, unless otherwise agreed in the publication agreement.

How to Cite

Evaluating the Performance and Scalability o fan Apache Spark and Hadoop Cluster in a Low-Cost Environment. (2025). C&T Riqchary Science and Technology Research Magazine, 7(2), 49-53. https://doi.org/10.57166/riqchary.v7.n2.2025.6

Download Citation

References

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communica-tions of the ACM, vol. 51, no. 1, pp. 107-113, Jan. 2008.

M. Zaharia et al., "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," NSDI '12: Proceedings of the 9th USE-NIX Conference on Networked Systems Design and Im-plementation, pp. 15-28, Apr. 2012.

M. Zaharia et al., "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, Nov. 2016.

K. Shvachko et al., "The Hadoop Distributed File System," IEEE 26th Symposium on Mass Storage Sys-tems and Technologies, pp. 1-10, May 2010.

M. Zaharia et al., "Spark: Cluster Computing with Working Sets," HotCloud '10: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10-10, Jun. 2010.

S. Ghemawat, H. Gobioff, and S. T. Leung, "The Google File System," ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29-43, Oct. 2003.

R. Lämmel, "Google's MapReduce Programming Model — Revisited," Science of Computer Program-ming, vol. 70, no. 1, pp. 1-30, Jan. 2008.

J. Ekanayake et al., "Twister: A Runtime for Iterative MapReduce," HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distribut-ed Computing, pp. 810-818, Jun. 2010.

T. White, Hadoop: The Definitive Guide, 4th ed. Sebas-topol, CA: O'Reilly Media, 2015.

C. Ranger et al., "Evaluating MapReduce for Multi-core and Multiprocessor Systems," HPCA '07: Pro-ceedings of the 13th International Symposium on High Performance Computer Architecture, pp. 13-24, Feb. 2007.

V. K. Vavilapalli et al., "Apache Hadoop YARN: Yet Another Resource Negotiator," SoCC '13: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1-16, Oct. 2013.

Ubuntu Documentation Team, "Ubuntu Server Guide," Canonical Ltd., 2024. [Online]. Available: https://ubuntu.com/server/docs

S. Ryza et al., Advanced Analytics with Spark: Patterns for Learning from Data at Scale, 1st ed. Sebastopol, CA: O'Reilly Media, 2015.

A. S. Tanenbaum and D. J. Wetherall, Computer Net-works, 5th ed. Boston, MA: Pearson, 2011.

P. Boncz et al., "Breaking the Memory Wall in MonetDB," Communications of the ACM, vol. 51, no. 12, pp. 77-85, Dec. 2008.

Apache Software Foundation, "Apache Hadoop Doc-umentation," 2024. [Online]. Available: https://hadoop.apache.org/docs/stable/

T. Condie et al., "MapReduce Online," NSDI '10: Pro-ceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, pp. 21-21, Apr. 2010.

Apache Software Foundation, "Apache Spark Docu-mentation," 2024. [Online]. Available: https://spark.apache.org/docs/latest/

M. Isard et al., "Dryad: Distributed Data-parallel Programs from Sequential Building Blocks," EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys Eu-ropean Conference on Computer Systems, pp. 59-72, Mar. 2007.

H. Karau et al., Learning Spark: Lightning-Fast Big Da-ta Analysis, 1st ed. Sebastopol, CA: O'Reilly Media, 2015.

Article Sidebar

Main Article Content

Abstract

Article Details

How to Cite

References

Most read articles by the same author(s)