We propose SemanticVocoder, which innovatively generates waveforms directly from semantic latents. The core advantages are:
- Enables the audio generation framework to operate in the semantic latent space, while eliminating any reliance on VAE modules and mitigating their adverse effects;
- Empowers our text-to-audio system to achieve strong performance on AudioCaps, with a Fréchet Distance of 12.823 and a Fréchet Audio Distance of 1.709;
- Allows the two-stage pipeline (text-to-latent & latent-to-waveform) to be independently trained with semantic latents as the anchor, supporting plug-and-play deployment;
- Bridges semantic latents and generative tasks, enabling semantic latents to support unified modeling for both audio generation and audio understanding.
Important
Use --single-branch --branch main or --depth=1 to avoid downloading oversized files.
git clone --single-branch --branch main https://github.com/zeyuxie29/SemanticVocoder
or
git clone --depth=1 https://github.com/zeyuxie29/SemanticVocoderNote
you may need to adjust the PyTorch version in build_env.sh to match your hardware
cd SemanticVocoder/src/inference
sh bash_scripts/build_env.shsh bash_scripts/infer.shSet up the environment as noted above, then execute encoding and decoding.
import os
import json
import hydra
import torch
import torchaudio
import soundfile as sf
from tqdm import tqdm
from huggingface_hub import snapshot_download
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
PATH = snapshot_download(
repo_id="ZeyuXie/SemanticVocoderSnapshot",
allow_patterns=[
"semanticVocoder_epoch-270.pt",
"config.json",
],
cache_dir=None,
)
model_config = json.load(open(f"{PATH}/config.json", "r"))
autoencoder = hydra.utils.instantiate(model_config["model"]["autoencoder"])
sample_rate = autoencoder.sample_rate
model_path = f"{PATH}/semanticVocoder_epoch-270.pt"
autoencoder._load_checkpoint(model_path)
autoencoder.eval()
autoencoder.to(device)
test_audio = "data/audiocaps/test/Y_C2HinL8VlM.wav"
test_output = "test_output/Y_C2HinL8VlM.wav"
waveform, sr = torchaudio.load(test_audio)
# Resample if needed
if sr != sample_rate :
print(f"Resampling from {sr} Hz to {autoencoder.sample_rate} Hz")
waveform = torchaudio.functional.resample(
waveform, orig_freq=sr, new_freq=sample_rate
)
waveform = waveform.to(device)
# Encode
waveform_lengths = torch.tensor([waveform.shape[-1]], device=device)
z, z_mask = autoencoder.encode(waveform, waveform_lengths)
print(f"Latent shape: {z.shape}")
# Decode
recon = autoencoder.decode(z, vocoder_steps=200)
print(f"Reconstructed waveform shape: {recon.shape}")
output_path = f"{test_output}"
sf.write(output_path, recon.squeeze().squeeze().cpu().numpy(), sample_rate)
print(f"Saved reconstructed audio to {test_output}")- Add demo page
- Release text-to-audio generation inference code and usage instructions
- Release vocoder inference module (responsible for encoding latent representations and decoding)
- Release vocoder training code
- Release text-to-audio generation training code