Research on Hive Integration Application of Data Warehouse Based on Big Data Platform
Download as PDF
DOI: 10.23977/ICAMCS2022.033
Author(s)
Haixia Wang, Shuguang Cui
Corresponding Author
Shuguang Cui
ABSTRACT
Hive query data has a high delay because there is no index, so the whole database table needs to be scanned, and after HQL is converted into MapReduce program, the execution is delayed. Relatively speaking, the database latency is low; But if the data scale is very large, Hive's parallel computing can show its advantages. In this paper, the application research of Hive integration of data warehouse based on big data platform is launched. Based on the analysis and design of Hive-based online learning data warehouse, this paper puts forward a concrete implementation scheme of online learning data warehouse, which can provide high scalability by combining the virtualization technology of university cloud platform. According to different sources and formats, it is generally necessary to customize different data extraction and conversion tools, write data cleaning programs, check the consistency of data, and load complete data into the data warehouse environment on a regular basis through data loading. While ETL is implemented, this paper realizes the script of deleting fixed partitions according to the characteristics that all tables in Hive are stored by date partitions, and also uses Shell commands to execute it.
KEYWORDS
Big data; Data warehouse; Hive