This project uses a Convolutional Neural Network (CNN) built with TensorFlow to identify specific audio events from audio recordings. The model is trained on labeled .wav audio clips (e.g., a target sound vs. background noise) and can then perform inference on long-format .mp3 recordings to count the occurrences of that sound.
The core of the project involves converting audio signals into spectrograms (visual representations of sound) and training a CNN to recognize the specific visual patterns of the target audio event.
- Audio Processing: Uses
librosato load and resample audio files to 16kHz mono. - Data Pipeline: Employs
tf.datafor a highly efficient input pipeline (caching, prefetching, and batching). - Spectrogram Generation: Converts audio waves into spectrograms using
tf.signal.stft. - CNN Model: A lightweight
tensorflow.keras.Sequentialmodel for binary classification of images (spectrograms). - Inference on Long Audio: A robust inference workflow that:
- Loads long
.mp3recordings. - Splits them into 3-second chunks.
- Preprocesses each chunk into a spectrogram.
- Runs batch predictions on all chunks.
- Loads long
- Smart Post-processing: Uses
itertools.groupbyto group consecutive positive predictions, allowing the script to count distinct "audio events" rather than just every 3-second chunk that contains the sound. - CSV Reporting: Saves the final counts of detected events for each recording to a
results.csvfile.
- Load Data: The script loads positive (target sound) and negative (background/other sounds)
.wavfiles from their respective folders into atf.data.Dataset. - Preprocessing: A
preprocessfunction is mapped across the dataset. It:- Loads the audio wave.
- Pads or truncates it to 3 seconds (48,000 samples).
- Generates a spectrogram using
tf.signal.stft. - Resizes the spectrogram to
(128, 128)to be a consistent input for the CNN.
- Training: The
tf.datapipeline shuffles, batches, and feeds the spectrograms to the Keras CNN model. The model learns to distinguish between spectrograms containing the target audio event and those that do not. - Inference:
- The script loads each
.mp3from the inference folder. tf.keras.utils.timeseries_dataset_from_arrayis used to slice the long audio file into 3-second (48,000-sample) windows.- Each slice is preprocessed into a 128x128 spectrogram.
- The trained model predicts on all slices in batches.
- The script loads each
- Post-processing & Output:
- Raw probability predictions are converted to
0or1. - The
groupbyfunction condenses sequences like[0, 1, 1, 1, 0, 0, 1]into[0, 1, 0, 1]. - The total number of
1s (distinct audio events) is summed for each file. - The final results are written to
results.csv.
- Raw probability predictions are converted to
- Python 3.9+
ffmpeg(forlibrosato load MP3 files)- macOS (via Homebrew):
brew install ffmpeg - Linux (via apt):
sudo apt update && sudo apt install ffmpeg
- macOS (via Homebrew):
-
Clone the repository:
git clone (https://github.com/Dhy4n-117/Deep-Audio-Classifier.git) cd Deep-Audio-Classifier -
Create a virtual environment (Recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies: Create a
requirements.txtfile with the following content:tensorflow librosa numpy matplotlib
Then, install them:
pip install -r requirements.txt
Got it. Here is an updated Datasets & Folder Structure section for your README. This new section is more detailed and clearly explains what each dataset is for.
You should replace the existing "3. Data Structure" section in your README with this one.
This project requires three sets of data, which you must provide in the following folder structure:
Deep-Audio-Classifier/
│
├── data/
│ │
│ ├── Parsed_Capuchinbird_Clips/ # 1. POSITIVE Training Data
│ │ ├── clip1.wav
│ │ └── ...
│ │
│ ├── Parsed_Not_Capuchinbird_Clips/ # 2. NEGATIVE Training Data
│ │ ├── not_clip1.wav
│ │ └── ...
│ │
│ └── Forest Recordings/ # 3. INFERENCE Data
│ ├── recording_00.mp3
│ └── ...
│
├── classifier.py # The main Python script
└── ...
- Purpose: To teach the model what your target sound is.
- Format: Short
.wavfiles (ideally 2-5 seconds) that clearly contain the audio event you want to detect. - Label: The script automatically assigns these a label of
1.
- Purpose: To teach the model what your target sound is not. This is just as important!
- Format: Short
.wavfiles containing background noise, other similar-sounding (but incorrect) events, or silence. - Label: The script automatically assigns these a label of
0.
- Purpose: This is the data you want to analyze after the model is trained.
- Format: Long-format
.mp3or.wavfiles that the script will scan for the target sound. - Output: The script will analyze these files and generate a
results.csvlisting the detected event counts for each file.
Note: The folder names Parsed_Capuchinbird_Clips and Parsed_Not_Capuchinbird_Clips are hardcoded in the script. You must use these exact folder names for your positive and negative samples, even if your target sound isn't a bird.
Once your data is in place and dependencies are installed, simply run the script:
python classifier.py
The script will train the model, plot the training history, and then create a results.csv file in your root directory.