🌱 SemanticVocoder：Bridging Audio Generation and Audio Understanding via Semantic Latents

💥 SemanticVocoder

We propose SemanticVocoder, which innovatively generates waveforms directly from semantic latents. The core advantages are:

Enables the audio generation framework to operate in the semantic latent space, while eliminating any reliance on VAE modules and mitigating their adverse effects;
Empowers our text-to-audio system to achieve strong performance on AudioCaps, with a Fréchet Distance of 12.823 and a Fréchet Audio Distance of 1.709;
Allows the two-stage pipeline (text-to-latent & latent-to-waveform) to be independently trained with semantic latents as the anchor, supporting plug-and-play deployment;
Bridges semantic latents and generative tasks, enabling semantic latents to support unified modeling for both audio generation and audio understanding.

Text-to-Audio Generation Inference

Clone the repository:

Important

Use --single-branch --branch main or --depth=1 to avoid downloading oversized files.

git clone --single-branch --branch main https://github.com/zeyuxie29/SemanticVocoder
or
git clone --depth=1 https://github.com/zeyuxie29/SemanticVocoder

Install dependencies:

Note

you may need to adjust the PyTorch version in build_env.sh to match your hardware

cd SemanticVocoder/src/inference
sh bash_scripts/build_env.sh

Run inference:

sh bash_scripts/infer.sh

SemanticVocoder Inference

Set up the environment as noted above, then execute encoding and decoding.

import os
import json

import hydra
import torch
import torchaudio
import soundfile as sf
from tqdm import tqdm 
from huggingface_hub import snapshot_download

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

PATH = snapshot_download(
    repo_id="ZeyuXie/SemanticVocoderSnapshot",
    allow_patterns=[
        "semanticVocoder_epoch-270.pt",
        "config.json",
    ],  
    cache_dir=None,
)

model_config = json.load(open(f"{PATH}/config.json", "r"))
autoencoder = hydra.utils.instantiate(model_config["model"]["autoencoder"])
sample_rate = autoencoder.sample_rate

model_path = f"{PATH}/semanticVocoder_epoch-270.pt"
autoencoder._load_checkpoint(model_path)
autoencoder.eval()
autoencoder.to(device)

test_audio = "data/audiocaps/test/Y_C2HinL8VlM.wav"
test_output = "test_output/Y_C2HinL8VlM.wav"

waveform, sr = torchaudio.load(test_audio)
# Resample if needed
if sr != sample_rate :
    print(f"Resampling from {sr} Hz to {autoencoder.sample_rate} Hz")
    waveform = torchaudio.functional.resample(
        waveform, orig_freq=sr, new_freq=sample_rate 
    )

waveform = waveform.to(device)

# Encode
waveform_lengths = torch.tensor([waveform.shape[-1]], device=device)
z, z_mask = autoencoder.encode(waveform, waveform_lengths)
print(f"Latent shape: {z.shape}")

# Decode
recon = autoencoder.decode(z, vocoder_steps=200)
print(f"Reconstructed waveform shape: {recon.shape}")

output_path = f"{test_output}"
sf.write(output_path, recon.squeeze().squeeze().cpu().numpy(), sample_rate)
print(f"Saved reconstructed audio to {test_output}")

TODO List

Add demo page
Release text-to-audio generation inference code and usage instructions
Release vocoder inference module (responsible for encoding latent representations and decoding)
Release vocoder training code
Release text-to-audio generation training code

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/inference		src/inference
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌱 SemanticVocoder：Bridging Audio Generation and Audio Understanding via Semantic Latents

Table of Contents

💥 SemanticVocoder

Text-to-Audio Generation Inference

Clone the repository:

Install dependencies:

Run inference:

SemanticVocoder Inference

TODO List

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🌱 SemanticVocoder：Bridging Audio Generation and Audio Understanding via Semantic Latents

Table of Contents

💥 SemanticVocoder

Text-to-Audio Generation Inference

Clone the repository:

Install dependencies:

Run inference:

SemanticVocoder Inference

TODO List

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages