Evaluación del Rendimiento y la Escalabilidad de un Clúster Apache Spark y Hadoop en un Entorno de Bajo Costo

Contenido principal del artículo

Natalie Cruz Tumba
Alex Lancho Ramos
Henry Leon Hurtado
Rafael Ricardo Quispe Merma

Resumen

Este artículo presenta el diseño, configuración e implementación de un clúster de cómputo distribuido utilizando Apache Spark y Hadoop sobre Ubuntu Server 24.04.1 LTS. La arquitectura consta de un nodo maestro y múltiples nodos esclavos conectados en red local mediante Ethernet. Se detalla el proceso de instalación, configuración y pruebas de rendimiento con PySpark. Los resultados demuestran que, si bien una configuración local es más eficiente para datasets pequeños (<100 MB), el clúster distribuido ofrece mejoras significativas para volúmenes de datos superiores a 1 GB, validando su escalabilidad y viabilidad para entornos educativos y de investigación con recursos limitados.

Detalles del artículo

Cómo citar
Evaluación del Rendimiento y la Escalabilidad de un Clúster Apache Spark y Hadoop en un Entorno de Bajo Costo. (2025). C&T Riqchary Revista De investigación En Ciencia Y tecnología, 7(2), 49-53. https://doi.org/10.57166/riqchary.v7.n2.2025.6
Sección
Artículos

Cómo citar

Evaluación del Rendimiento y la Escalabilidad de un Clúster Apache Spark y Hadoop en un Entorno de Bajo Costo. (2025). C&T Riqchary Revista De investigación En Ciencia Y tecnología, 7(2), 49-53. https://doi.org/10.57166/riqchary.v7.n2.2025.6

Referencias

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communica-tions of the ACM, vol. 51, no. 1, pp. 107-113, Jan. 2008.

M. Zaharia et al., "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," NSDI '12: Proceedings of the 9th USE-NIX Conference on Networked Systems Design and Im-plementation, pp. 15-28, Apr. 2012.

M. Zaharia et al., "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, Nov. 2016.

K. Shvachko et al., "The Hadoop Distributed File System," IEEE 26th Symposium on Mass Storage Sys-tems and Technologies, pp. 1-10, May 2010.

M. Zaharia et al., "Spark: Cluster Computing with Working Sets," HotCloud '10: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10-10, Jun. 2010.

S. Ghemawat, H. Gobioff, and S. T. Leung, "The Google File System," ACM SIGOPS Operating Systems Review, vol. 37, no. 5, pp. 29-43, Oct. 2003.

R. Lämmel, "Google's MapReduce Programming Model — Revisited," Science of Computer Program-ming, vol. 70, no. 1, pp. 1-30, Jan. 2008.

J. Ekanayake et al., "Twister: A Runtime for Iterative MapReduce," HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distribut-ed Computing, pp. 810-818, Jun. 2010.

T. White, Hadoop: The Definitive Guide, 4th ed. Sebas-topol, CA: O'Reilly Media, 2015.

C. Ranger et al., "Evaluating MapReduce for Multi-core and Multiprocessor Systems," HPCA '07: Pro-ceedings of the 13th International Symposium on High Performance Computer Architecture, pp. 13-24, Feb. 2007.

V. K. Vavilapalli et al., "Apache Hadoop YARN: Yet Another Resource Negotiator," SoCC '13: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1-16, Oct. 2013.

Ubuntu Documentation Team, "Ubuntu Server Guide," Canonical Ltd., 2024. [Online]. Available: https://ubuntu.com/server/docs

S. Ryza et al., Advanced Analytics with Spark: Patterns for Learning from Data at Scale, 1st ed. Sebastopol, CA: O'Reilly Media, 2015.

A. S. Tanenbaum and D. J. Wetherall, Computer Net-works, 5th ed. Boston, MA: Pearson, 2011.

P. Boncz et al., "Breaking the Memory Wall in MonetDB," Communications of the ACM, vol. 51, no. 12, pp. 77-85, Dec. 2008.

Apache Software Foundation, "Apache Hadoop Doc-umentation," 2024. [Online]. Available: https://hadoop.apache.org/docs/stable/

T. Condie et al., "MapReduce Online," NSDI '10: Pro-ceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, pp. 21-21, Apr. 2010.

Apache Software Foundation, "Apache Spark Docu-mentation," 2024. [Online]. Available: https://spark.apache.org/docs/latest/

M. Isard et al., "Dryad: Distributed Data-parallel Programs from Sequential Building Blocks," EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys Eu-ropean Conference on Computer Systems, pp. 59-72, Mar. 2007.

H. Karau et al., Learning Spark: Lightning-Fast Big Da-ta Analysis, 1st ed. Sebastopol, CA: O'Reilly Media, 2015.