SimEngine

Overview

The provided project computes similarity scores between two lists of strings using various similarity metrics. These metrics include cosine similarity, Jaccard similarity, and more.

Project Structure

metrics.py: This file contains multiple similarity and distance metrics.
- Cosine Similarity
- Jaccard Similarity
- Jensen Shannon Divergence & Distance
- Jaccard Similarity with Edit Distance
similarity_engine.py: Central file containing the SimilarityEngine class which processes and computes similarity scores.
utils.py: Utility functions including:
- SimilarityDict dataclass
- Batch data generator
- Functions to save results to Excel
preprocessing.py: Handles data preprocessing with preprocessors such as:
- HardPreprocessor
- TFIDFPreprocessor
- ArabertPreprocessor

Getting Started

Setup: Ensure the installation of requirements.txt:
```
pip install requirements.txt
```
Usage:

3.1. Quick Usage

    from SimEngine.models.embedding import EmbeddingInterface
    from SimEngine.models.ner import NERInterface
    from SimEngine.similarity_engine import SimilarityEngine

    list1 = ["This is a sample string.", "Another example."]
    list2 = ["A different sample string.", "Yet another example."]
    
    # Initialize the embedding models
    embedding = EmbeddingInterface()
    
    # Initialize the NER models
    ner = NERInterface()   
    
    # Initialize the similarity engine
    engine = SimilarityEngine(embedding_interface=embedding, ner_interface=ner)
    
    # Fit the similarity engine
    sim_dict = engine.fit(x1 = list1, x2=list2)

3.2. Detailed Usage

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

from SimEngine.models.embedding import CAMeL, AraBERTv2, MARBERT, FastTextArabicEmbedder, TFIDFEmbedder, EmbeddingInterface
from SimEngine.models.ner import Hatmimoha, NERInterface
from SimEngine.preprocessing import TFIDFPreprocessor, ArabertPreprocessor, HardPreprocessor
from SimEngine.similarity_engine import SimilarityEngine

list1 = ["This is a sample string.", "Another example."]
list2 = ["A different sample string.", "Yet another example."]

# Prepare the word weights for FastTextArabicEmbedder
tf_idf = TfidfVectorizer()
tf_idf = tf_idf.fit(fc_text + contract_text)
word_weights = dict(zip(tf_idf.get_feature_names_out(), tf_idf.idf_))

# Initialize the embedding models
embedding = EmbeddingInterface(
                                embedding_model=[
                                                CAMeL(pooling_strategy='mean'),
                                                FastTextArabicEmbedder(word_weights = word_weights, pooling_strategy='max'),
                                                ],
                                similarity_metric = 'cosine',
                                weight=0.85
)

# Initialize the NER models
ner = NERInterface(
                ner_model = Hatmimoha(),
                weight = 0.15,
                similarity_metric = 'jaccard_edit'
                )   


# Initialize the similarity engine
engine = SimilarityEngine(
                          embedding_interface =  embedding, # Embedding models to use
                          ner_interface = ner, # NER models to use
                          preprocessing = [TFIDFPreprocessor()], # Preprocessing techniques to use
                          threshold = 0.80, # Min similarity score to consider
                          top_k = 10, # Return top k similar entires 
                          )

sim_dict = engine.fit(x1 = fc_text, x2 = contract_text)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
SimEngine		SimEngine
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimEngine

Overview

Project Structure

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SimEngine

Overview

Project Structure

Getting Started

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages