-
Notifications
You must be signed in to change notification settings - Fork 80
Description
Hi,
I executed Deckard to detect clones on a dataset of 47k source files. However, after a day of execution I faced with the an error. following,, you can find the content of different log files.
cluster_vdb_50_4_g9_2.50998_30_100000
Clustering 'vectors/vdb_50_4_g9_2.50998_30_100000' 6.513064 ...
/home/local/SAIL/amir/tasks/RQ2/RQ2.2/Deckard/src/lsh/bin/enumBuckets -R 6.513064 -M 7600000000 -b 2 -A -f vectors/vdb_50_4_g9_2.50998_30_100000 -c -p vectors/vdb_50_4_g9_2.50998_30_100000.param > clusters/cluster_vdb_50_4_g9_2.50998_30_100000
Warning: output all clones. Takes more time...
Warning: will compute parameters
Error: the structure supports at most 2097151 points (3238525 were specified).
real 2m58.162s
user 2m50.464s
sys 0m7.492s
cluster: Possible errors occurred with LSH. Check log: times/cluster_vdb_50_4_g9_2.50998_30_100000
paramsetting_50_4_0.79_30
paramsetting: 50 4 0.79 ...Looking for optimal parameters by Clustering 'vectors/vdb_50_4_g9_2.50998_30_100000' 6.513064 ...
/home/local/SAIL/amir/tasks/RQ2/RQ2.2/Deckard/src/lsh/bin/enumBuckets -R 6.513064 -M 7600000000 -b 2 -A -f vectors/vdb_50_4_g9_2.50998_30_100000 -c -p vectors/vdb_50_4_g9_2.50998_30_100000.param > clusters/cluster_vdb_50_4_g9_2.50998_30_100000
cluster: Possible errors occurred with LSH. Check log: times/cluster_vdb_50_4_g9_2.50998_30_100000
Error: paramsetting failure...exit.
grouping_50_4_2.50998_30
grouping: vectors/vdb_50_4 with distance=2.50998...Total 7602630 vectors read in; 11282415 vectors dispatched into 57 ranges (actual groups may be many fewer).
real 410m12.610s
user 6m43.592s
sys 26m6.544s
Done grouping 50 4 2.50998. See groups in vectors/vdb_50_4_g[0-9]_2.50998_30
Note that I have sufficient memory for execution; Thus, I added two other conditions for the memory limit setting in both vecquery and vertical-param-batch files. The reason I increased the memory limit is that my vectors size is greater than 2G and I have no problem with the availability of enough memory. Now the conditions are like this:
# dumb (not flexible) memory limit setting
mem=`wc "$vdb" | awk '{printf("%.0f", $3/1024/1024+0.5)}'`
if [ $mem -lt 2 ]; then
mem=10000000
elif [ $mem -lt 5 ]; then
mem=20000000
elif [ $mem -lt 10 ]; then
mem=30000000
elif [ $mem -lt 20 ]; then
mem=60000000
elif [ $mem -lt 50 ]; then
mem=150000000
elif [ $mem -lt 100 ]; then
mem=300000000
elif [ $mem -lt 200 ]; then
mem=600000000
elif [ $mem -lt 500 ]; then
mem=900000000
elif [ $mem -lt 1024 ]; then
mem=1900000000
elif [ $mem -lt 2048 ]; then
mem=3800000000
elif [ $mem -lt 4096 ]; then # this condition is added by me
mem=7600000000
elif [ $mem -lt 8192 ]; then # this condition is added by me
mem=15200000000
else
echo "Error: Size of $vdb > 8G. I don't want to do it before you think of any optimization." | tee -a "$TIME_DIR/cluster_${vfile}"
exit 1;
fi
The parameters of deckard is set to the following values:
- MIN_TOKENS='50'
- STRIDE='4'
- SIMILARITY='0.79'
- MAX_PROCS = 40
I attached the log files. please help me to mitigate this problem, I need your tool for my experiments.
deckard log.zip