Skip to content

TeamBenign/codeEmbedding

Repository files navigation

Code Embedding and Visualization Assignment

This assignment explores code embedding techniques using CodeBERT to analyze and visualize the semantic relationships between code snippets.

Overview

In this assignment, we:

  1. Sample 5 pairs of code snippets (10 total)
  2. Generate embeddings for these code snippets using CodeBERT
  3. Visualize the embeddings using t-SNE dimensionality reduction
  4. Analyze the relationships between semantically equivalent and non-equivalent code pairs

Setup and Installation

Requirements

  • Python 3.8 or higher
  • Required packages are listed in requirements.txt

Installation

python3 -m venv .venv
source .venv/bin/activate

Install the required dependencies using pip:

pip install -r requirements.txt

Execution Environment

This notebook was developed and tested in VS Code. It is also compatible with Google Colab.

Files Included

  • codeEmbedding.ipynb: The main Jupyter notebook containing all code and analysis
  • requirements.txt: List of required Python packages
  • readME.md: Description of the project and setup instructions.
  • code_embeddings_tsne.png: t-SNE visualization of code embeddings
  • code_pair_distances.png: Bar chart visualization of distances between code pairs

Implementation Details

1. Code Snippet Selection

The implementation uses random 5 pairs of code snippets from the test set of CodeXGLUE dataset:

  • Pairs 1 and 5: Semantically equivalent implementations (same functionality, different code)
  • Pairs 2-4: Non-equivalent implementations (different algorithms and functionalities)

2. Code Embedding

We use the CodeBERT model from Hugging Face to generate 768-dimensional embeddings for each code snippet. These embeddings capture the semantic meaning of the code rather than just syntactic similarities.

3. Visualization

t-SNE (t-distributed Stochastic Neighbor Embedding) is used to reduce the 768-dimensional embeddings to 2D space for visualization. The visualization shows:

  • The relative positions of code snippets in the embedding space
  • Connected pairs with solid lines (semantically equivalent) or dashed lines (not equivalent)
  • Color-coded pairs for clear identification

A separate bar chart visualization shows the distances between code pairs in the t-SNE space.

Key Findings

Based on our t-SNE visualization of the CodeBERT embeddings for the 5 pairs of code snippets, we can make several observations:

  1. Semantic Equivalence Relationship: We observe a very unusual scenario of our tested code pairs. The average distances for semantically equivalent and non-equivalent code pairs are 60.80 and 30.05, respectively. It is because of pair 1 which deviates the most. Other than this, the CodeBERT effectively captures the semantic meaning of code rather than just syntactic similarities.

  2. Algorithm Differentiation: CodeBERT can distinguish between algorithms with different computational complexities and approaches, even when they might share some syntactic structures except pair 1. For pair 1, while both involve file I/O operations and use similar buffering techniques, their core functionality is different. One is for ZIP extraction, and the other is for simple file copying.

  3. Limitations: While the model generally performs well, there might be cases where the distance in embedding space doesn't perfectly correlate with semantic equivalence. This could be due to various factors including the limited context window of the model, or nuanced differences in the implementations that affect the embedding.

These findings demonstrate that CodeBERT embeddings, when visualized using t-SNE, can effectively represent the semantic relationships between code snippets, making it a valuable tool for code similarity analysis and clone detection tasks.

Conclusion

CodeBERT embeddings, when visualized with t-SNE, provide valuable insights into the semantic relationships between code snippets. This approach shows promise for applications like code clone detection, plagiarism checking, and automated code understanding.

The embedding space demonstrates meaningful clustering of similar code implementations while separating fundamentally different algorithms, indicating that the model captures both semantic and functional aspects of code.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published