DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization (Interspeech 2025)
by Geonyoung Lee*, Geonhee Han*, Paul Hongsuck Seo
*Equal contribution
This is the official repository for our Interspeech 2025 paper:
DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization.
We propose a novel training-free framework that enables zero-shot language-queried audio source separation by repurposing pretrained text-to-audio diffusion models. DGMO refines magnitude spectrogram masks at test-time via guidance from diffusion-generated references.
DGMO consists of two key modules:
- Reference Generation: Uses DDIM inversion to generate query-conditioned audio references with pretrained diffusion models.
- Mask Optimization: Learns a spectrogram mask aligned to the reference, enabling faithful extraction of the target sound from the input mixture.
Unlike traditional LASS approaches, DGMO requires no training and generalizes across datasets with only test-time optimization.
We recommend using conda to create a clean environment:
conda create -n dgmo python=3.10 -y
conda activate dgmo
pip install -r requirements.txtMake sure you have ffmpeg installed if you work with audio files:
conda install -c conda-forge ffmpegYou can perform source separation using DGMO with a simple shell script.
Modify the following variables in the script:
# inference.sh
# Input mixture path
MIX_PATH="./data/samples/dog_barking_and_cat_meowing.wav"
# Text queries (e.g., sources you want to extract)
TEXTS=("dog barking" "cat meowing")Each text query corresponds to a target sound to be separated.
Run the script as follows:
bash inference.shThis will:
- Run DGMO inference for each query
- Save the separated audio as
.wavfiles - Create a timestamped directory for organized output (e.g.,
./results/run_20250607_170502/)
Our implementation builds on several open-source projects including AudioLDM, Auffusion, and Peekaboo. We sincerely thank the authors for their contributions.
This project includes components licensed under CC BY-NC-SA 4.0.
See LICENSE for full terms.
Specifically, we incorporate ideas and/or pretrained models from:
🔒 Note: This project is for non-commercial research and educational use only, as required by the licenses of the incorporated models.
If you find our work useful in your research, please consider citing:
@inproceedings{lee25g_interspeech,
title = {{DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization}},
author = {{Geonyoung Lee and Geonhee Han and Paul Hongsuck Seo}},
year = {{2025}},
booktitle = {{Interspeech 2025}},
pages = {{4983--4987}},
doi = {{10.21437/Interspeech.2025-840}},
issn = {{2958-1796}},
}


