Large Variation in Block Lengths Within Clusters in post_cluster_*

## Description
I am using Deckard with the following configuration:
```
MIN_TOKENS='20 30 50'  # can be a sequence of integers
STRIDE='inf'  # can be a sequence of integers
SIMILARITY='1'  # can be a sequence of values <= 1
```

I set stride as *inf* because:
- The setting with an infinite stride means that vector merging was disabled. [reference](https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=2010&context=sis_research)
- If stride is set to infinity, only non-overlapping and syntactically complete pieces of code (e.g., a complete if statement or a complete for statement) are considered for clones. [reference](https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1941&context=sis_research)

After running Deckard, I noticed that **some clusters in `post_cluster_*` contain code blocks with vastly different lengths**. For example, the following cluster includes one block with 1057 lines while the other has only 110 lines:
```
000000042	dist:0.0	FILE src/findbugs/src/java/edu/umd/cs/findbugs/SortedBugCollection.java LINE:92:1057 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:517 TEID:536
000000008	dist:0.0	FILE src/spotbugs/spotbugs/src/main/java/edu/umd/cs/findbugs/classfile/impl/JrtfsCodeBase.java LINE:62:110 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:294 TEID:313
```
Since dist:0.0 suggests they are considered identical clones, I would like to understand why blocks of such different lengths are grouped together.

Here are the original files in the above cluster:
[files.zip](https://github.com/user-attachments/files/18948389/files.zip)

I also tried setting *stride=2*, but the large variation in block lengths within clusters still persists.

## Questions
1. How does Deckard determine similarity when block lengths vary significantly?
2. Could this be due to my configuration, or is it expected behavior?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Large Variation in Block Lengths Within Clusters in post_cluster_* #36

Description

Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Large Variation in Block Lengths Within Clusters in post_cluster_* #36

Description

Description

Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions