Bug Report: cluster: Possible errors occurred with LSH.

Hi,
I executed Deckard to detect clones on a dataset of 47k source files. However, after a day of execution I faced with the an error. following,, you can find the content of different log files.


### cluster_vdb_50_4_g9_2.50998_30_100000
Clustering 'vectors/vdb_50_4_g9_2.50998_30_100000' 6.513064 ...
/home/local/SAIL/amir/tasks/RQ2/RQ2.2/Deckard/src/lsh/bin/enumBuckets -R 6.513064 -M 7600000000 -b 2 -A -f vectors/vdb_50_4_g9_2.50998_30_100000 -c -p vectors/vdb_50_4_g9_2.50998_30_100000.param  >  clusters/cluster_vdb_50_4_g9_2.50998_30_100000
Warning: output all clones. Takes more time...
Warning: will compute parameters
Error: the structure supports at most 2097151 points (3238525 were specified).

real	2m58.162s
user	2m50.464s
sys	0m7.492s
cluster: Possible errors occurred with LSH. Check log: times/cluster_vdb_50_4_g9_2.50998_30_100000

### paramsetting_50_4_0.79_30
paramsetting: 50 4 0.79 ...Looking for optimal parameters by Clustering 'vectors/vdb_50_4_g9_2.50998_30_100000' 6.513064 ...
/home/local/SAIL/amir/tasks/RQ2/RQ2.2/Deckard/src/lsh/bin/enumBuckets -R 6.513064 -M 7600000000 -b 2 -A -f vectors/vdb_50_4_g9_2.50998_30_100000 -c -p vectors/vdb_50_4_g9_2.50998_30_100000.param  >  clusters/cluster_vdb_50_4_g9_2.50998_30_100000
cluster: Possible errors occurred with LSH. Check log: times/cluster_vdb_50_4_g9_2.50998_30_100000
Error: paramsetting failure...exit.

### grouping_50_4_2.50998_30 
grouping: vectors/vdb_50_4 with distance=2.50998...Total 7602630 vectors read in; 11282415 vectors dispatched into 57 ranges (actual groups may be many fewer).

real	410m12.610s
user	6m43.592s
sys	26m6.544s
Done grouping 50 4 2.50998. See groups in vectors/vdb_50_4_g[0-9]*_2.50998_30*


Note that I have sufficient memory for execution; Thus, I added two other conditions for the memory limit setting in both vecquery and vertical-param-batch files. The reason I increased the memory limit is that my vectors size is greater than 2G and I have no problem with the availability of enough memory. Now the conditions are like this:

	# dumb (not flexible) memory limit setting
	mem=`wc "$vdb" | awk '{printf("%.0f", $3/1024/1024+0.5)}'`
	if [ $mem -lt 2 ]; then
		mem=10000000
	elif [ $mem -lt 5 ]; then
		mem=20000000
	elif [ $mem -lt 10 ]; then
		mem=30000000
	elif [ $mem -lt 20 ]; then
		mem=60000000
	elif [ $mem -lt 50 ]; then
		mem=150000000
	elif [ $mem -lt 100 ]; then
		mem=300000000
	elif [ $mem -lt 200 ]; then
		mem=600000000
	elif [ $mem -lt 500 ]; then
		mem=900000000
	elif [ $mem -lt 1024 ]; then
		mem=1900000000
	elif [ $mem -lt 2048 ]; then
		mem=3800000000
	elif [ $mem -lt 4096 ]; then  # this condition is added by me
		mem=7600000000
	elif [ $mem -lt 8192 ]; then  # this condition is added by me
		mem=15200000000
	else
		echo "Error: Size of $vdb > 8G. I don't want to do it before you think of any optimization." | tee -a "$TIME_DIR/cluster_${vfile}"
		exit 1;
	fi

The parameters of deckard is set to the following values:
- MIN_TOKENS='50'
- STRIDE='4' 
- SIMILARITY='0.79'
- MAX_PROCS = 40

I attached the log files. please help me to mitigate this problem, I need your tool for my experiments. 
[deckard log.zip](https://github.com/skyhover/Deckard/files/5548546/deckard.log.zip)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report: cluster: Possible errors occurred with LSH. #27

cluster_vdb_50_4_g9_2.50998_30_100000

paramsetting_50_4_0.79_30

grouping_50_4_2.50998_30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug Report: cluster: Possible errors occurred with LSH. #27

Description

cluster_vdb_50_4_g9_2.50998_30_100000

paramsetting_50_4_0.79_30

grouping_50_4_2.50998_30

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions