ImpostorHunt

Data Preparation Pipeline

To get the best results, use the following data pipeline:

Extract Text from Files:
- Run build_text_csv.py to create data/train_text.csv with the actual text and labels.
- Example:
```
python build_text_csv.py
```
Data Augmentation:
- Run augment_data.py to generate data/train_augmented.csv with additional training samples.
- Example:
```
python augment_data.py
```
Configure Training:
- In config.py, set TRAIN_CSV to 'data/train_augmented.csv' for training with augmented data.
Train and Predict:
- Run main.py to train the model and make predictions on the test set.
- Example:
```
python main.py
```

This pipeline ensures your model is trained on the most diverse and complete data available.

ImpostorHunt

ImpostorHunt is a machine learning project designed to detect fake texts in a dataset using a fine-tuned BERT model. For each data sample, the system receives two texts: one real and one fake. The model is trained to distinguish between them and predict which text is real.

Features

Fine-tunes a BERT model for text pair classification.
Automatically skips training if a saved model is found.
Loads and predicts on all test samples in the data/test directory.
Provides a summary of predictions, including counts and average confidence.

Project Structure

main.py: Main script for training, evaluation, and prediction.
model_trainer.py: Contains the BERT training and prediction logic.
data_loader.py: Handles data loading and preprocessing.
config.py: Configuration for paths, model, and training parameters.
model_utils.py: Utility functions for model management and result summarization.
data/: Contains training and test data.
bert_model_save/: Directory where the trained model and tokenizer are saved.

Usage

Place your training data in data/train and test data in data/test (each article in its own subfolder with file_1.txt and file_2.txt).
(Optional but recommended) Run build_text_csv.py to extract text from files and create data/train_text.csv:
```
python build_text_csv.py
```
(Optional but recommended) Run augment_data.py to generate augmented training data in data/train_augmented.csv:
```
python augment_data.py
```
Ensure config.py points to the correct training CSV (e.g., train_augmented.csv for best results).
Run main.py to train (if needed) and predict on all test samples:
```
python main.py
```
The script will print predictions and a summary of results.

Requirements

Python 3.8+
PyTorch
Transformers
tqdm
pandas
nltk

Install dependencies with:

pip install -r requirements.txt

Suppressing Transformers Warnings

To suppress verbose warnings from the Hugging Face Transformers library (e.g., about overflowing tokens), add this to the top of your main script:

from transformers import logging
logging.set_verbosity_error()

Example Output

article_0001: file_1.txt is REAL, file_2.txt is FAKE (Confidence: 0.9876)
article_0002: file_2.txt is REAL, file_1.txt is FAKE (Confidence: 0.9123)
...
Summary: 100 total, 55 real, 45 fake, avg confidence: 0.95

For more details, see the code and comments in each file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Preparation Pipeline

ImpostorHunt

Features

Project Structure

Usage

Requirements

Suppressing Transformers Warnings

Example Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
augment_data.py		augment_data.py
build_text_csv.py		build_text_csv.py
config.py		config.py
data_loader.py		data_loader.py
impostorhunt_colab.ipynb		impostorhunt_colab.ipynb
main.py		main.py
model_trainer.py		model_trainer.py
model_utils.py		model_utils.py
requirements.txt		requirements.txt
test_data_loader.py		test_data_loader.py
test_model_utils.py		test_model_utils.py

Folders and files

Latest commit

History

Repository files navigation

Data Preparation Pipeline

ImpostorHunt

Features

Project Structure

Usage

Requirements

Suppressing Transformers Warnings

Example Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages