使用 Hive CLI 连接 Hive 3.1.2 并查询对应的 Hudi 映射的 Hive 表,发现如下异常:
hive (flk_hive)> select * from status_h2h limit 10; 22/10/24 15:22:07 INFO conf.HiveConf: Using the default value passed in for log id: 0f8a42a6-8195-413a-90dc-a31f7f96f1f0 22/10/24 15:22:07 INFO session.SessionState: Updating thread name to 0f8a42a6-8195-413a-90dc-a31f7f96f1f0 main 22/10/24 15:22:07 INFO ql.Driver: Compiling command(queryId=hadoop_20221024152207_133658b2-28c5-4a69-9b2f-b4b2ce99994a): select * from status_h2h limit 10 22/10/24 15:22:07 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager 22/10/24 15:22:07 INFO parse.SemanticAnalyzer: Starting Semantic Analysis 22/10/24 15:22:07 INFO parse.SemanticAnalyzer: Completed phase 1 of Semantic Analysis 22/10/24 15:22:07 INFO parse.SemanticAnalyzer: Get metadata for source tables FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.hudi.hadoop.HoodieParquetInputFormat 22/10/24 15:22:08 ERROR ql.Driver: FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.hudi.hadoop.HoodieParquetInputFormat java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hudi.hadoop.HoodieParquetInputFormat at org.apache.hadoop.hive.ql.metadata.Table.getInputFormatClass(Table.java:324) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:2191) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:2075) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:12033) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12129) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11676) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:285) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:659) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1826) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1773) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1768) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:214) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Caused by: java.lang.ClassNotFoundException: org.apache.hudi.hadoop.HoodieParquetInputFormat at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.hive.ql.metadata.Table.getInputFormatClass(Table.java:321) ... 24 more
根据报错信息 Caused by: java.lang.ClassNotFoundException: org.apache.hudi.hadoop.HoodieParquetInputFormat 推断时缺少相应的 Jar 包所导致的异常。
翻看 Hudi 0.10.0 集成 Hive 的文档,文档链接,可以看到需要将 hudi-hadoop-mr-bundle 的相应 Jar 包放至 $HIVE_HOME/auxlib 中。如果没有相应 auxlib 路径则需要新建 (注意权限问题)。
注意如果需要使用 Hudi 集成 Hive312 则需要重新编译 Hudi 包,并在编译时使用 -Pflink-bundle-shade-hive3 参数。
org.apache.hive 3.1.2 core
mvn clean install -^DskipTests -^Dmaven.test.skip=true -^DskipITs -^Dcheckstyle.skip=true -^Drat.skip=true -^Dhadoop.version=3.0.0-cdh6.3.2 -^Pflink-bundle-shade-hive3 -^Dscala-2.12 -^Pspark-shade-unbundle-avro
[root@p0-tklcdh-nn03 auxlib]# pwd /opt/cloudera/parcels/CDH/lib/hive/auxlib [root@p0-tklcdh-nn03 auxlib]# ls -l total 16892 -rw-r--r-- 1 appadmin appadmin 17294810 Oct 24 16:41 hudi-hadoop-mr-bundle-0.10.0.jar
特别注意: 编译时的参数和 Hive 版本一定要指定正确,否则相应 Jar 包的大小不同,会出现各种奇奇怪怪的问题。
# 这是 Hive2 相关的 hudi-hadoop-mr-bundle-0.10.0.jar 大小 [hadoop@p0-tklfrna-tklrna-device02 auxlib]$ ls -l ../../../jars/hudi-hadoop-mr-bundle-0.10.0.jar -rw-r--r-- 1 root root 17289727 Mar 28 2022 ../../../jars/hudi-hadoop-mr-bundle-0.10.0.jar # 这是 Hive3 相关的 hudi-hadoop-mr-bundle-0.10.0.jar 大小 [root@p0-tklcdh-nn03 auxlib]# ls -l total 16892 -rw-r--r-- 1 appadmin appadmin 17294810 Oct 24 16:41 hudi-hadoop-mr-bundle-0.10.0.jar
将相应的 Hudi 依赖放至 $HIVE_HOME/auxlib 下,重启 hivemetastore 和 hiveserver2 再次使用 Hive CLI 进行查询
22/10/24 17:12:39 INFO ql.Driver: Executing command(queryId=root_20221024171238_96b1962b-b9b4-44b2-a554-2c7055fdf253): select * from status_h2h limit 10 22/10/24 17:12:39 INFO ql.Driver: Completed executing command(queryId=root_20221024171238_96b1962b-b9b4-44b2-a554-2c7055fdf253); Time taken: 0.001 seconds OK 22/10/24 17:12:39 INFO ql.Driver: OK 22/10/24 17:12:39 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager 22/10/24 17:12:39 INFO utils.HoodieInputFormatUtils: Reading hoodie metadata from path hdfs://10.132.62.2/hudi/flk_hudi/status_hudi 22/10/24 17:12:39 INFO table.HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://10.132.62.2/hudi/flk_hudi/status_hudi 22/10/24 17:12:39 INFO table.HoodieTableConfig: Loading table properties from hdfs://10.132.62.2/hudi/flk_hudi/status_hudi/.hoodie/hoodie.properties 22/10/24 17:12:39 INFO table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from hdfs://10.132.62.2/hudi/flk_hudi/status_hudi 22/10/24 17:12:39 INFO utils.HoodieInputFormatUtils: Found a total of 1 groups 22/10/24 17:12:40 INFO timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[==>20221024171048577__commit__INFLIGHT]} 22/10/24 17:12:40 INFO view.FileSystemViewManager: Creating InMemory based view for basePath hdfs://10.132.62.2/hudi/flk_hudi/status_hudi 22/10/24 17:12:40 INFO view.AbstractTableFileSystemView: Took 4 ms to read 0 instants, 0 replaced file groups 22/10/24 17:12:40 INFO util.ClusteringUtils: Found 0 files in pending clustering operations 22/10/24 17:12:40 INFO view.AbstractTableFileSystemView: Building file system view for partition () 22/10/24 17:12:40 INFO view.AbstractTableFileSystemView: addFilesToView: NumFiles=10297, NumFileGroups=16, FileGroupsCreationTime=449, StoreTimeTaken=2 22/10/24 17:12:40 INFO utils.HoodieInputFormatUtils: Total paths to process after hoodie filter 16 22/10/24 17:12:40 INFO view.AbstractTableFileSystemView: Took 1 ms to read 0 instants, 0 replaced file groups 22/10/24 17:12:40 INFO util.ClusteringUtils: Found 0 files in pending clustering operations 22/10/24 17:12:40 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 22/10/24 17:12:41 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 16680 records. 22/10/24 17:12:41 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 22/10/24 17:12:41 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 22/10/24 17:12:41 INFO compress.CodecPool: Got brand-new decompressor [.gz] 22/10/24 17:12:41 INFO hadoop.InternalParquetRecordReader: block read in memory in 42 ms. row count = 16680 Done