Research on General File Storage Scheme Based on Hadoop
Download as PDF
DOI: 10.23977/CNCI2020077
Author(s)
Huimin Liu and Huijie Liu
Corresponding Author
Huimin Liu
ABSTRACT
HDFS is Hadoop's underlying distributed file storage system, designed for large files. When accessing a large number of small files, the problem of low read and write rate and excessive memory load will be encountered. Therefore, HDFS is optimized for the problem of different size files. Firstly, the historical access log is analyzed, the relevance between files is acquired by using Apriori algorithm, the correlation probability model is established, and then a directed graph merging algorithm based on correlation is proposed. In order to solve the problem of low access rate caused by small file merging, prefetching strategy and LRU substitution strategy based on high heat were introduced. Experimental results show that this scheme can effectively reduce the metadata volume of NameNode, improve the memory utilization, and effectively improve the file read and write performance.
KEYWORDS
Merging algorithm; correlation; storage of common files; cache replacement strategy