-
Notifications
You must be signed in to change notification settings - Fork 989
Open
Labels
Description
Code of Conduct
- I agree to follow this project's Code of Conduct
Search before asking
- I have searched in the issues and found no similar issues.
Describe the bug
set spark.sql.catalog.hive_catalog org.apache.kyuubi.spark.connector.hive.HiveTableCatalog
use hive_catalog;
drop table test_part_table;
create table test_part_table(
word string,
num bigint
)partitioned by(dt string) stored as orc;
drop table test_part_table_tmp;
create table test_part_table_tmp(
word string,
num bigint,
dt string
);
insert into test_part_table_tmp (word,num,dt) values('1',1,'1111'),('2',2,'2222'),('3',4,'1111');
insert overwrite table test_part_table partition (dt) select word,num,dt from test_part_table_tmp;
org.apache.hadoop.fs.FileAlreadyExistsException: /warehouse/tablespace/hive/test_part_table/.hive-staging_hive_2026-02-26_12-41-55_305_5577159179436818095-1/-ext-10000/_temporary/0/_temporary/attempt_202602261241555610893446772809343_0000_m_000000_0/dt=1111/part-00000-6a1697f8-a24a-40dd-b926-6fd6634c0323.c000 for client 192.168.1.57 already exists
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:389)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2732)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2625)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:496)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
- Locate the bug code
org.apache.kyuubi.spark.connector.hive.write.FileWriterFactory.51
A thread will write data to multiple partitions. For different partitions, Spark will first close the writer and then create a new one
The writer for the same partition will be created multiple times, and the writers for ORC and Parquet do not allow duplicate files
Another question
create table test_table(
word string,
num bigint
)stored as orc;
insert into test_table values('1',1111);
select * from test_table;
1 1111
insert into test_table values('2',1111);
select * from test_table;
2 1111
1 1111
In batch processing, Spark needs to repeatedly read Hive data multiple times, and the data read multiple times should be the same
Affects Version(s)
1.10.3
Kyuubi Server Log Output
Kyuubi Engine Log Output
Kyuubi Server Configurations
Kyuubi Engine Configurations
Additional context
No response
Are you willing to submit PR?
- Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
- No. I cannot submit a PR at this time.
Reactions are currently unavailable