This project trains a BERT-based classifier to detect fake news using HuggingFace Transformers and PyTorch.
Fake.csv– Dataset of fake news articlesTrue.csv– Dataset of real news articlesmain.py– Main training and evaluation scriptdata_analysis.ipynb– Notebook for dataset exploration and visualizationbert-base-uncased/– (Optional) Local BERT model directory (or use HuggingFace download)
Make sure you have Python 3.7+ and install the following packages:
pip install transformers datasets scikit-learn pandas numpy torch nltkYou also need to download NLTK stopwords:
import nltk
nltk.download('stopwords')
Run data_analysis.ipynb to explore and visualize the dataset. It performs the following:
- Loads and merges
Fake.csvandTrue.csv - Assigns labels (
0= Fake,1= Real) - Samples 3000 articles for faster experimentation
- Visualizes class distribution:
- Fake vs Real label balance (relatively balanced)
- Distribution of
subjectcategories by label (unbalanced, not used as feature)
- Observes that the
datefield contains some noisy or invalid strings (e.g., URLs), sodateis excluded as a feature
The analysis helps confirm that only the
textfield is suitable as a classification input.
- Make sure
Fake.csvandTrue.csvare in the same folder asmain.py. - (Optional) If using a local model, ensure
bert-base-uncased/is in the same directory and modify this line in the code:
bert_name = "./bert-base-uncased" # path to your local model
Otherwise, the model will be downloaded automatically from HuggingFace.
- Run training:
python main.pyAfter training, the best model is saved in the ./results directory and evaluation metrics will be printed.
- Training progress and validation metrics printed during training
- Final test accuracy, precision, recall, and F1 score