-
Notifications
You must be signed in to change notification settings - Fork 80
Open
Description
Description
I am using Deckard with the following configuration:
MIN_TOKENS='20 30 50' # can be a sequence of integers
STRIDE='inf' # can be a sequence of integers
SIMILARITY='1' # can be a sequence of values <= 1
I set stride as inf because:
- The setting with an infinite stride means that vector merging was disabled. reference
- If stride is set to infinity, only non-overlapping and syntactically complete pieces of code (e.g., a complete if statement or a complete for statement) are considered for clones. reference
After running Deckard, I noticed that some clusters in post_cluster_* contain code blocks with vastly different lengths. For example, the following cluster includes one block with 1057 lines while the other has only 110 lines:
000000042 dist:0.0 FILE src/findbugs/src/java/edu/umd/cs/findbugs/SortedBugCollection.java LINE:92:1057 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:517 TEID:536
000000008 dist:0.0 FILE src/spotbugs/spotbugs/src/main/java/edu/umd/cs/findbugs/classfile/impl/JrtfsCodeBase.java LINE:62:110 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:294 TEID:313
Since dist:0.0 suggests they are considered identical clones, I would like to understand why blocks of such different lengths are grouped together.
Here are the original files in the above cluster:
files.zip
I also tried setting stride=2, but the large variation in block lengths within clusters still persists.
Questions
- How does Deckard determine similarity when block lengths vary significantly?
- Could this be due to my configuration, or is it expected behavior?
Metadata
Metadata
Assignees
Labels
No labels