Comparison of Compact Generative Models for Automatic Question Answering in Spanish via Retrieval-Augmented Generation
Main Article Content
Abstract
This study compares five compact generative models (≤ 8 billion parameters) for Spanish question answering under a retrieval-augmented generation (RAG) pipeline executed locally. We assess response quality using F1, BLEU-4, and an external semantic judge (LLM-Judge), alongside efficiency indicators (P95 latency, memory, GPU/CPU). Results show Mistral 7B achieves the highest average F1 and semantic scores, whereas OpenHermes 7B attains nearly identical accuracy with the lowest memory footprint. Zephyr 7B-β performs well on very long documents, and Phi-3 Mini minimizes tail latency under adverse conditions. A Pareto analysis of F1–RAM identifies Mistral 7B and OpenHermes 7B as non-dominated solutions, yielding practical guidelines depending on operational goals (maximum accuracy vs. resource efficiency). The paper contributes a reproducible Spanish-language comparison under RAG and actionable criteria for local deployments.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
When an author creates an article and publishes it in a journal, the copyright passes to the journal as part of the publishing agreement. Therefore, the journal becomes the owner of the rights to reproduce, distribute and sell the article. The author retains some rights, such as the right to be recognized as the creator of the article and the right to use the article for his or her own scholarly or research purposes, unless otherwise agreed in the publication agreement.
How to Cite
References
J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez, “Spanish Pre-trained BERT Model and Evaluation Data,” Aug. 2023, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2308.02976
J. Cañete, S. Donoso, F. Bravo-Marquez, A. Carval-lo, and V. Araujo, “ALBETO and DistilBETO: Lightweight Spanish Language Models,” 2022 Lan-guage Resources and Evaluation Conference, LREC 2022, pp. 4291–4298, Apr. 2022, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2204.09145
A. Gutiérrez-Fandiño et al., “MarIA: Spanish Lan-guage Models,” Procesamiento del Lenguaje Natural, vol. 68, pp. 39–60, Apr. 2022, doi: 10.26342/2022-68-3.
P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” Adv Neural Inf Process Syst, vol. 2020-December, May 2020, Ac-cessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2005.11401
K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. W. Chang, “REALM: Retrieval-Augmented Language Model Pre-Training,” 37th International Conference on Machine Learning, ICML 2020, vol. PartF168147-6, pp. 3887–3896, Feb. 2020, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2002.08909
P. Lewis, B. Oguz, R. Rinott, S. Riedel, and H. Schwenk, “MLQA: Evaluating Cross-lingual Extrac-tive Question Answering,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 7315–7330, Oct. 2019, doi: 10.18653/v1/2020.acl-main.653.
A. Grattafiori et al., “The Llama 3 Herd of Models,” Jul. 2024, Accessed: Aug. 11, 2025. [Online]. Availa-ble: https://arxiv.org/pdf/2407.21783
A. Q. Jiang et al., “Mistral 7B,” Oct. 2023, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2310.06825
“HuggingFaceH4/zephyr-7b-beta · Hugging Face.” Accessed: Aug. 11, 2025. [Online]. Available: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta
M. Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone,” Apr. 2024, Accessed: Aug. 11, 2025. [Online]. Avail-able: https://arxiv.org/pdf/2404.14219
“teknium/OpenHermes-7B · Hugging Face.” Ac-cessed: Aug. 11, 2025. [Online]. Available: https://huggingface.co/teknium/OpenHermes-7B
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettle-moyer, “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale,” Adv Neural Inf Process Syst, vol. 35, Aug. 2022, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2208.07339
J. Johnson, M. Douze, and H. Jegou, “Billion-scale similarity search with GPUs,” IEEE Trans Big Data, vol. 7, no. 3, pp. 535–547, Feb. 2017, doi: 10.1109/TBDATA.2019.2921572.
M. Douze et al., “The Faiss library,” Jan. 2024, Ac-cessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2401.08281
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, Morristown, NJ, USA: Association for Computation-al Linguistics, 2001, p. 311. doi: 10.3115/1073083.1073135.
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” 8th International Conference on Learning Representations, ICLR 2020, Apr. 2019, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/1904.09675
C.-Y. Lin, “ROUGE: A Package for Automatic Eval-uation of Summaries,” 2004. Accessed: Aug. 11, 2025. [Online]. Available: https://aclanthology.org/W04-1013/
Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” Proceedings - 2024 Conference on AI, Science, Engineering, and Tech-nology, AIxSET 2024, pp. 166–169, Dec. 2023, doi: 10.1109/AIxSET62544.2024.00030.
H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu, and Z. Liu, “Evaluation of Retrieval-Augmented Genera-tion: A Survey,” Communications in Computer and In-formation Science, vol. 2301, pp. 102–120, Jul. 2024, doi: 10.1007/978-981-96-1024-2_8.
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” 11th Interna-tional Conference on Learning Representations, ICLR 2023, Oct. 2022, Accessed: Aug. 11, 2025. [Online]. Available: https://arxiv.org/pdf/2210.17323