webhookAPI Endpoints

Google Speech Recognition (ASR) API Reference

The Google ASR API provides real-time speech-to-text transcription using Google Cloud Speech-to-Text. This WebSocket-based endpoint enables low-latency streaming recognition for live audio processing.

Base URL: wss://api.openmind.org

Authentication: Requires an OpenMind API key passed as a query parameter.

Endpoints Overview

Protocol
Endpoint
Description

WebSocket

/api/core/google/asr

Real-time speech recognition via WebSocket connection

WebSocket Connection

Establish a persistent WebSocket connection for streaming audio data and receiving real-time transcription results.

Endpoint: wss://api.openmind.org/api/core/google/asr?api_key=YOUR_API_KEY

Connection Parameters

Parameter
Type
Required
Description

api_key

string

Yes

Your OpenMind API key for authentication

Connection Example

# Using wscat (install with: npm install -g wscat)
wscat -c "wss://api.openmind.org/api/core/google/asr?api_key=om1_live_your_api_key"

Connection Response

Upon successful connection, you'll receive a confirmation message:

{
  "type": "connection",
  "message": "Connected to ASR service",
  "clientId": "1738713600000-a1b2c3d4e5f6g7h8"
}

Connection Errors

401 Unauthorized - Missing API Key:

401 Unauthorized - Invalid API Key:

Sending Audio Data

Message Format

Send audio data as JSON messages over the WebSocket connection:

Message Fields

Field
Type
Required
Default
Description

audio

string

Yes

-

Base64-encoded audio data (LINEAR16 format)

rate

integer

No

16000

Audio sample rate in Hz

language_code

string

No

"en-US"

Language code for recognition (e.g., "en-US", "es-ES", "fr-FR")

Note the following when sending audio data: - The `rate` and `language_code` parameters only need to be sent with the first message. Subsequent messages can contain only the `audio` field. - Audio must be LINEAR16 PCM encoded - Maximum streaming duration is 4 minutes (240 seconds) per session

Receiving Transcription Results

Response Format

Transcription Result:

Error Message:

Response Fields

Field
Type
Description

asr_reply

string

Final transcription result for the audio segment

clientId

string

Unique identifier for the WebSocket session

type

string

Message type ("connection", "error")

message

string

Human-readable message for connection or error events

Audio Specifications

Supported Audio Format

  • Encoding: LINEAR16 (16-bit PCM)

  • Sample Rate: 16000 Hz (recommended) or custom rate specified in first message

  • Channels: Mono (1 channel)

  • Sample Width: 2 bytes (16-bit)

Calculating Audio Length

Audio duration is calculated as:

For 16000 Hz mono LINEAR16:

Usage Examples

Python Example

JavaScript/Node.js Example

Using wscat (Command Line)

Recording Audio for Testing

Using SoX (Sound eXchange):

Using FFmpeg:

Language Support

The ASR service supports multiple languages. Specify the language code in the first message:

Language
Code

English (US)

en-US

English (UK)

en-GB

Spanish (Spain)

es-ES

Spanish (Latin America)

es-419

French

fr-FR

German

de-DE

Italian

it-IT

Portuguese (Brazil)

pt-BR

Japanese

ja-JP

Korean

ko-KR

Chinese (Mandarin)

zh-CN

For a complete list of supported languages, refer to the Google Cloud Speech-to-Text documentationarrow-up-right.

Error Handling

Common Errors

Invalid Message Format:

Missing Audio Field:

Audio Decoding Error:

Speech Recognition Error:

Handling Connection Loss

The WebSocket connection may close due to:

  • Network interruptions

  • 4-minute streaming limit reached

  • Client disconnect

  • Server errors

Implement reconnection logic in your client:

Session Management

Streaming Limit

Each recognition session has a maximum duration of 4 minutes (240 seconds). After this time:

  • The current stream will automatically restart

  • A new recognition session will begin

  • Audio processing continues seamlessly

Session Cleanup

When the WebSocket connection closes:

  • All buffered audio is processed

  • Final transcriptions are sent

  • Usage tracking is recorded

  • Resources are cleaned up

Client Identification

Each connection receives a unique clientId in the format:

This ID is included in all server responses for tracking and debugging purposes.

Cost Calculation

Speech recognition costs are calculated based on the total audio duration processed:

Usage is tracked and billed to the API key provided in the connection URL.

Note the following about cost calculation: - Audio length is calculated automatically from the data sent - Only successfully processed audio is billed - Usage details are available in your OpenMind dashboard

Best Practices

Audio Quality

  • Use high-quality audio input (clear speech, minimal background noise)

  • Maintain consistent audio levels

  • Use the recommended 16000 Hz sample rate for optimal recognition

  • Send audio in consistent chunk sizes (1024-4096 bytes recommended)

Network Optimization

  • Implement exponential backoff for reconnection attempts

  • Buffer audio locally during temporary connection issues

  • Monitor WebSocket connection health

  • Handle network interruptions gracefully

Error Handling

  • Always validate the API key before establishing connections

  • Check for error messages in server responses

  • Implement retry logic for transient failures

  • Log client IDs for debugging and support requests

Performance Tips

  • Send audio chunks at regular intervals (every 50-100ms)

  • Avoid sending very large or very small chunks

  • Don't accumulate audio before sending - stream in real-time

  • Process transcription results asynchronously

Security

  • Never hardcode API keys in client-side code

  • Use environment variables for API key storage

  • Rotate API keys regularly

  • Monitor API key usage for suspicious activity

Troubleshooting

No Transcription Results

  • Verify audio format is LINEAR16 PCM

  • Check sample rate matches the rate parameter

  • Ensure audio contains clear speech

  • Verify language code matches the spoken language

Connection Issues

  • Confirm API key is valid and active

  • Check WebSocket support in your environment

  • Verify network allows WebSocket connections

  • Test connection with wscat first

Poor Recognition Quality

  • Increase audio quality/bitrate

  • Reduce background noise

  • Speak clearly and at normal pace

  • Try adjusting the language model if available

Buffer Full Warnings

If you see "Audio stream buffer full" in logs:

  • Reduce the rate of audio sending

  • Increase chunk send interval

  • Check for network congestion

  • Verify client is reading responses

Example: Complete Integration

Here's a complete example integrating microphone input, WebSocket streaming, and real-time display:

Additional Resources

Last updated

Was this helpful?