Skip to content

Large Variation in Block Lengths Within Clusters in post_cluster_* #36

@Momo-Not-Emo

Description

@Momo-Not-Emo

Description

I am using Deckard with the following configuration:

MIN_TOKENS='20 30 50'  # can be a sequence of integers
STRIDE='inf'  # can be a sequence of integers
SIMILARITY='1'  # can be a sequence of values <= 1

I set stride as inf because:

  • The setting with an infinite stride means that vector merging was disabled. reference
  • If stride is set to infinity, only non-overlapping and syntactically complete pieces of code (e.g., a complete if statement or a complete for statement) are considered for clones. reference

After running Deckard, I noticed that some clusters in post_cluster_* contain code blocks with vastly different lengths. For example, the following cluster includes one block with 1057 lines while the other has only 110 lines:

000000042	dist:0.0	FILE src/findbugs/src/java/edu/umd/cs/findbugs/SortedBugCollection.java LINE:92:1057 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:517 TEID:536
000000008	dist:0.0	FILE src/spotbugs/spotbugs/src/main/java/edu/umd/cs/findbugs/classfile/impl/JrtfsCodeBase.java LINE:62:110 NODE_KIND:0 nVARs:1 NUM_NODE:277 TBID:294 TEID:313

Since dist:0.0 suggests they are considered identical clones, I would like to understand why blocks of such different lengths are grouped together.

Here are the original files in the above cluster:
files.zip

I also tried setting stride=2, but the large variation in block lengths within clusters still persists.

Questions

  1. How does Deckard determine similarity when block lengths vary significantly?
  2. Could this be due to my configuration, or is it expected behavior?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions