Code Embedding and Visualization Assignment

This assignment explores code embedding techniques using CodeBERT to analyze and visualize the semantic relationships between code snippets.

Overview

In this assignment, we:

Sample 5 pairs of code snippets (10 total)
Generate embeddings for these code snippets using CodeBERT
Visualize the embeddings using t-SNE dimensionality reduction
Analyze the relationships between semantically equivalent and non-equivalent code pairs

Setup and Installation

Requirements

Python 3.8 or higher
Required packages are listed in requirements.txt

Installation

python3 -m venv .venv
source .venv/bin/activate

Install the required dependencies using pip:

pip install -r requirements.txt

Execution Environment

This notebook was developed and tested in VS Code. It is also compatible with Google Colab.

Files Included

codeEmbedding.ipynb: The main Jupyter notebook containing all code and analysis
requirements.txt: List of required Python packages
readME.md: Description of the project and setup instructions.
code_embeddings_tsne.png: t-SNE visualization of code embeddings
code_pair_distances.png: Bar chart visualization of distances between code pairs

Implementation Details

1. Code Snippet Selection

The implementation uses random 5 pairs of code snippets from the test set of CodeXGLUE dataset:

Pairs 1 and 5: Semantically equivalent implementations (same functionality, different code)
Pairs 2-4: Non-equivalent implementations (different algorithms and functionalities)

2. Code Embedding

We use the CodeBERT model from Hugging Face to generate 768-dimensional embeddings for each code snippet. These embeddings capture the semantic meaning of the code rather than just syntactic similarities.

3. Visualization

t-SNE (t-distributed Stochastic Neighbor Embedding) is used to reduce the 768-dimensional embeddings to 2D space for visualization. The visualization shows:

The relative positions of code snippets in the embedding space
Connected pairs with solid lines (semantically equivalent) or dashed lines (not equivalent)
Color-coded pairs for clear identification

A separate bar chart visualization shows the distances between code pairs in the t-SNE space.

Key Findings

Based on our t-SNE visualization of the CodeBERT embeddings for the 5 pairs of code snippets, we can make several observations:

Semantic Equivalence Relationship: We observe a very unusual scenario of our tested code pairs. The average distances for semantically equivalent and non-equivalent code pairs are 60.80 and 30.05, respectively. It is because of pair 1 which deviates the most. Other than this, the CodeBERT effectively captures the semantic meaning of code rather than just syntactic similarities.
Algorithm Differentiation: CodeBERT can distinguish between algorithms with different computational complexities and approaches, even when they might share some syntactic structures except pair 1. For pair 1, while both involve file I/O operations and use similar buffering techniques, their core functionality is different. One is for ZIP extraction, and the other is for simple file copying.
Limitations: While the model generally performs well, there might be cases where the distance in embedding space doesn't perfectly correlate with semantic equivalence. This could be due to various factors including the limited context window of the model, or nuanced differences in the implementations that affect the embedding.

These findings demonstrate that CodeBERT embeddings, when visualized using t-SNE, can effectively represent the semantic relationships between code snippets, making it a valuable tool for code similarity analysis and clone detection tasks.

Conclusion

CodeBERT embeddings, when visualized with t-SNE, provide valuable insights into the semantic relationships between code snippets. This approach shows promise for applications like code clone detection, plagiarism checking, and automated code understanding.

The embedding space demonstrates meaningful clustering of similar code implementations while separating fundamentally different algorithms, indicating that the model captures both semantic and functional aspects of code.

References

CodeBERT: https://huggingface.co/microsoft/codebert-base
t-SNE: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
CodeXGlue benchmark: https://github.com/microsoft/CodeXGLUE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Embedding and Visualization Assignment

Overview

Setup and Installation

Requirements

Installation

Execution Environment

Files Included

Implementation Details

1. Code Snippet Selection

2. Code Embedding

3. Visualization

Key Findings

Conclusion

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
codeEmbedding.ipynb		codeEmbedding.ipynb
code_embeddings_tsne.png		code_embeddings_tsne.png
code_pair_distances.png		code_pair_distances.png
readME.md		readME.md
requirements.txt		requirements.txt

TeamBenign/codeEmbedding

Folders and files

Latest commit

History

Repository files navigation

Code Embedding and Visualization Assignment

Overview

Setup and Installation

Requirements

Installation

Execution Environment

Files Included

Implementation Details

1. Code Snippet Selection

2. Code Embedding

3. Visualization

Key Findings

Conclusion

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages