Skip to content

hive-connector two issues #7335

@CrazyBeeline

Description

@CrazyBeeline

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

set spark.sql.catalog.hive_catalog org.apache.kyuubi.spark.connector.hive.HiveTableCatalog
use hive_catalog;
drop table test_part_table;
create table test_part_table(
word string,
num bigint 
)partitioned by(dt string) stored as orc;

drop table test_part_table_tmp;
create table test_part_table_tmp(
word string,
num bigint,
dt string
);
insert into  test_part_table_tmp (word,num,dt) values('1',1,'1111'),('2',2,'2222'),('3',4,'1111');
insert overwrite table test_part_table partition (dt) select word,num,dt from test_part_table_tmp;
org.apache.hadoop.fs.FileAlreadyExistsException: /warehouse/tablespace/hive/test_part_table/.hive-staging_hive_2026-02-26_12-41-55_305_5577159179436818095-1/-ext-10000/_temporary/0/_temporary/attempt_202602261241555610893446772809343_0000_m_000000_0/dt=1111/part-00000-6a1697f8-a24a-40dd-b926-6fd6634c0323.c000 for client 192.168.1.57 already exists
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:389)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2732)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2625)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:807)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:496)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)
  1. Locate the bug code
    org.apache.kyuubi.spark.connector.hive.write.FileWriterFactory.51
Image

A thread will write data to multiple partitions. For different partitions, Spark will first close the writer and then create a new one
The writer for the same partition will be created multiple times, and the writers for ORC and Parquet do not allow duplicate files

Another question

create table test_table(
word string,
num bigint 
)stored as orc;

insert into test_table values('1',1111);

select * from test_table;

1	1111

insert into test_table values('2',1111);

select * from test_table;

2	1111
1	1111

In batch processing, Spark needs to repeatedly read Hive data multiple times, and the data read multiple times should be the same

Affects Version(s)

1.10.3

Kyuubi Server Log Output


Kyuubi Engine Log Output


Kyuubi Server Configurations


Kyuubi Engine Configurations


Additional context

No response

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • No. I cannot submit a PR at this time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions