To get the best results, use the following data pipeline:
-
Extract Text from Files:
- Run
build_text_csv.pyto createdata/train_text.csvwith the actual text and labels. - Example:
python build_text_csv.py
- Run
-
Data Augmentation:
- Run
augment_data.pyto generatedata/train_augmented.csvwith additional training samples. - Example:
python augment_data.py
- Run
-
Configure Training:
- In
config.py, setTRAIN_CSVto'data/train_augmented.csv'for training with augmented data.
- In
-
Train and Predict:
- Run
main.pyto train the model and make predictions on the test set. - Example:
python main.py
- Run
This pipeline ensures your model is trained on the most diverse and complete data available.
ImpostorHunt is a machine learning project designed to detect fake texts in a dataset using a fine-tuned BERT model. For each data sample, the system receives two texts: one real and one fake. The model is trained to distinguish between them and predict which text is real.
- Fine-tunes a BERT model for text pair classification.
- Automatically skips training if a saved model is found.
- Loads and predicts on all test samples in the
data/testdirectory. - Provides a summary of predictions, including counts and average confidence.
main.py: Main script for training, evaluation, and prediction.model_trainer.py: Contains the BERT training and prediction logic.data_loader.py: Handles data loading and preprocessing.config.py: Configuration for paths, model, and training parameters.model_utils.py: Utility functions for model management and result summarization.data/: Contains training and test data.bert_model_save/: Directory where the trained model and tokenizer are saved.
- Place your training data in
data/trainand test data indata/test(each article in its own subfolder withfile_1.txtandfile_2.txt). - (Optional but recommended) Run
build_text_csv.pyto extract text from files and createdata/train_text.csv:python build_text_csv.py
- (Optional but recommended) Run
augment_data.pyto generate augmented training data indata/train_augmented.csv:python augment_data.py
- Ensure
config.pypoints to the correct training CSV (e.g.,train_augmented.csvfor best results). - Run
main.pyto train (if needed) and predict on all test samples:python main.py
- The script will print predictions and a summary of results.
- Python 3.8+
- PyTorch
- Transformers
- tqdm
- pandas
- nltk
Install dependencies with:
pip install -r requirements.txtTo suppress verbose warnings from the Hugging Face Transformers library (e.g., about overflowing tokens), add this to the top of your main script:
from transformers import logging
logging.set_verbosity_error()article_0001: file_1.txt is REAL, file_2.txt is FAKE (Confidence: 0.9876)
article_0002: file_2.txt is REAL, file_1.txt is FAKE (Confidence: 0.9123)
...
Summary: 100 total, 55 real, 45 fake, avg confidence: 0.95
For more details, see the code and comments in each file.