This assignment explores code embedding techniques using CodeBERT to analyze and visualize the semantic relationships between code snippets.
In this assignment, we:
- Sample 5 pairs of code snippets (10 total)
- Generate embeddings for these code snippets using CodeBERT
- Visualize the embeddings using t-SNE dimensionality reduction
- Analyze the relationships between semantically equivalent and non-equivalent code pairs
- Python 3.8 or higher
- Required packages are listed in
requirements.txt
python3 -m venv .venv
source .venv/bin/activate
Install the required dependencies using pip:
pip install -r requirements.txt
This notebook was developed and tested in VS Code. It is also compatible with Google Colab.
codeEmbedding.ipynb: The main Jupyter notebook containing all code and analysisrequirements.txt: List of required Python packagesreadME.md: Description of the project and setup instructions.code_embeddings_tsne.png: t-SNE visualization of code embeddingscode_pair_distances.png: Bar chart visualization of distances between code pairs
The implementation uses random 5 pairs of code snippets from the test set of CodeXGLUE dataset:
- Pairs 1 and 5: Semantically equivalent implementations (same functionality, different code)
- Pairs 2-4: Non-equivalent implementations (different algorithms and functionalities)
We use the CodeBERT model from Hugging Face to generate 768-dimensional embeddings for each code snippet. These embeddings capture the semantic meaning of the code rather than just syntactic similarities.
t-SNE (t-distributed Stochastic Neighbor Embedding) is used to reduce the 768-dimensional embeddings to 2D space for visualization. The visualization shows:
- The relative positions of code snippets in the embedding space
- Connected pairs with solid lines (semantically equivalent) or dashed lines (not equivalent)
- Color-coded pairs for clear identification
A separate bar chart visualization shows the distances between code pairs in the t-SNE space.
Based on our t-SNE visualization of the CodeBERT embeddings for the 5 pairs of code snippets, we can make several observations:
-
Semantic Equivalence Relationship: We observe a very unusual scenario of our tested code pairs. The average distances for semantically equivalent and non-equivalent code pairs are 60.80 and 30.05, respectively. It is because of pair 1 which deviates the most. Other than this, the CodeBERT effectively captures the semantic meaning of code rather than just syntactic similarities.
-
Algorithm Differentiation: CodeBERT can distinguish between algorithms with different computational complexities and approaches, even when they might share some syntactic structures except pair 1. For pair 1, while both involve file I/O operations and use similar buffering techniques, their core functionality is different. One is for ZIP extraction, and the other is for simple file copying.
-
Limitations: While the model generally performs well, there might be cases where the distance in embedding space doesn't perfectly correlate with semantic equivalence. This could be due to various factors including the limited context window of the model, or nuanced differences in the implementations that affect the embedding.
These findings demonstrate that CodeBERT embeddings, when visualized using t-SNE, can effectively represent the semantic relationships between code snippets, making it a valuable tool for code similarity analysis and clone detection tasks.
CodeBERT embeddings, when visualized with t-SNE, provide valuable insights into the semantic relationships between code snippets. This approach shows promise for applications like code clone detection, plagiarism checking, and automated code understanding.
The embedding space demonstrates meaningful clustering of similar code implementations while separating fundamentally different algorithms, indicating that the model captures both semantic and functional aspects of code.
- CodeBERT: https://huggingface.co/microsoft/codebert-base
- t-SNE: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
- CodeXGlue benchmark: https://github.com/microsoft/CodeXGLUE