Analysis of Apache Hadoop Architecture in Supporting Large-Scale Data Processing

Main Article Content

Teuku Nabil Muhammad Dhuha
Asrianda
Muhammad Fikry

Abstract

The rapid development of information technology has led to the exponential growth of data generated from various sectors, such as healthcare services, social media, information systems, and other digital activities. This condition has given rise to the concept of big data, which cannot be optimally processed using conventional data processing technologies. Therefore, distributed computing platforms are required to efficiently handle large-scale data storage and processing. Apache Hadoop is one of the widely used big data technologies due to its distributed architecture that supports scalability, parallel processing, and fault tolerance. This study aims to analyze the architecture of Apache Hadoop and explain the role of each of its components in supporting large-scale data processing. The research method employed is a qualitative literature study, conducted through the review of books, scientific articles, and related publications on Hadoop. The results indicate that Hadoop consists of three main components: the Hadoop Distributed File System as a distributed storage system, MapReduce as a programming model for parallel data processing, and Yet Another Resource Negotiator, which functions in cluster resource management and scheduling. The integration of these components enables Hadoop to manage large-scale data in a reliable and distributed manner. However, Hadoop has limitations related to its batch-based processing model, which is less suitable for real-time processing needs, thus requiring consideration of complementary technologies according to application requirements.

Article Details

How to Cite
Muhammad Dhuha, T. N., Asrianda, & Muhammad Fikry. (2025). Analysis of Apache Hadoop Architecture in Supporting Large-Scale Data Processing. Jurnal Informasi Dan Teknologi, 1-6. https://doi.org/10.60083/jidt.vi0.711
Section
Articles
Author Biographies

Teuku Nabil Muhammad Dhuha, Universitas Malikussaleh

Department of Informatics

Asrianda, Universitas Malikussaleh

Department of Informatics

Muhammad Fikry, Universitas Malikussaleh

Department of Informatics

References

[1] R. Rawat and R. Yadav, “Big data: Big data analysis, issues and challenges and technologies,” IOP Conference Series: Materials Science and Engineering, vol. 1022, no. 1, p. 012014, 2021, doi: 10.1088/1757-899X/1022/1/012014.
[2] K. Batko and A. Ślęzak, “The use of big data analytics in healthcare,” Journal of Big Data, vol. 9, no. 3, 2022, doi: 10.1186/s40537-021-00553-4.
[3] J. Yang, Y. Li, Q. Liu, et al., “Brief introduction of medical database and data mining technology in big data era,” Journal of Evidence-Based Medicine, vol. 13, pp. 57–69, 2020, doi: 10.1111/jebm.12373.
[4] J. Wang, Y. Yang, T. Wang, R. Sherratt, and J. Zhang, “Big data service architecture: A survey,” Journal of Internet Technology, vol. 21, no. 2, pp. 393–405, 2020. [Online]. Available: https://jit.ndhu.edu.tw/article/view/2261/2274
[5] T. Lyu, P. Wang, Y. Gao, and Y. Wang, “Research on the big data of traditional taxi and online car-hailing: A systematic review,” Journal of Traffic and Transportation Engineering (English Edition), vol. 8, no. 1, pp. 1–34, 2021, doi: 10.1016/j.jtte.2021.01.001.
[6] G. Karya and V. S. Moertini, “Exploration of Hadoop big data technology for community-based application systems,” Jurnal Rekayasa Sistem dan Teknologi Informasi, vol. 1, no. 2, 2017, doi: 10.29207/resti.v1i2.65.
[7] S. R. Julakanti, N. S. K. Sattiraju, and R. Julakanti, “Creating high-performance data workflows with Hadoop components,” NeuroQuantology, vol. 19, no. 11, 2021, doi: 10.48047/nq.2021.19.11.NQ21326.
[8] S. Hedayati, N. Maleki, T. Olsson, et al., “MapReduce scheduling algorithms in Hadoop: A systematic study,” Journal of Cloud Computing, vol. 12, p. 143, 2023, doi: 10.1186/s13677-023-00520-9.
[9] F. D. Utami and F. D. Astuti, “Comparison of Hadoop MapReduce and Apache Spark in big data processing with Hgrid247-DE,” Journal of Applied Informatics and Computing, vol. 8, no. 2, pp. 390–399, 2024, doi: 10.30871/jaic.v8i2.8557.
[10] N. H. Wicaksana, F. X. Arunanto, and H. Studiawan, “Implementation of transfer rate management in SDN-based HDFS processes,” Jurnal Teknik ITS, vol. 5, no. 2, pp. 576–579, 2016, doi: 10.12962/j23373539.v5i2.18976.
[11] S. Petrova and S. Ivanov, “Integration of a distributed Hadoop system into the infrastructure of a technology startup company,” Izvestia Journal of the Union of Scientists – Varna. Economic Sciences Series, vol. 9, no. 2, pp. 76–84, 2020, doi: 10.36997/IJUSV-ESS/2020.9.2.76.
[12] O. Azeroual and R. Fabre, “Processing big data with Apache Hadoop in the current challenging era of COVID-19,” Big Data and Cognitive Computing, vol. 5, no. 1, p. 12, 2021, doi: 10.3390/bdcc5010012.
[13] S. Landset, T. M. Khoshgoftaar, A. Richter, and T. Hasanin, “A survey of open source tools for machine learning with big data in the Hadoop ecosystem,” Journal of Big Data, vol. 2, 2015, doi: 10.1186/s40537-015-0032-1.
[14] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop distributed file system,” in Proc. IEEE 26th Symp. on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, 2010, pp. 1–10, doi: 10.1109/MSST.2010.5496972.
[15] M. Saadoon, S. H. A. Hamid, H. Sofian, et al., “Experimental analysis in Hadoop MapReduce: A closer look at fault detection and recovery techniques,” Sensors, vol. 21, no. 11, p. 3799, 2021, doi: 10.3390/s21113799.
[16] S. Ketu, P. K. Mishra, and S. Agarwal, “Performance analysis of distributed computing frameworks for big data analytics: Hadoop vs Spark,” Computación y Sistemas, vol. 24, no. 2, pp. 669–686, 2020, doi: 10.13053/cys-24-2-3401.
[17] L. Thomas and R. Syama, “Survey on MapReduce scheduling algorithms,” International Journal of Computer Applications, vol. 95, no. 23, pp. 9–13, 2014, doi: 10.5120/16733-6903.
[18] M. Timothy and O. J. Abiodun, “A fault-tolerance model for Hadoop rack-aware resource management system,” Journal of Computer Science and Engineering (JCSE), vol. 4, no. 1, pp. 15–24, 2023.
[19] Ó. Castellanos-Rodríguez, R. R. Expósito, J. Enes, G. L. Taboada, and J. Touriño, “Serverless-like platform for container-based YARN clusters,” Future Generation Computer Systems, vol. 155, pp. 256–271, 2024, doi: 10.1016/j.future.2024.02.013.
[20] N. Ahmed, A. L. C. Barczak, T. Susnjak, et al., “A comprehensive performance analysis of Apache Hadoop and Apache Spark for large-scale data sets using HiBench,” Journal of Big Data, vol. 7, p. 110, 2020, doi: 10.1186/s40537-020-00388-5.
[21] R. Guo, Y. Zhao, Q. Zou, X. Fang, and S. Peng, “Bioinformatics applications on Apache Spark,” GigaScience, vol. 7, no. 8, p. giy098, 2018, doi: 10.1093/gigascience/giy098.