Amazon Transcribe

Best Self Hosted Alternatives to Amazon Transcribe

A curated collection of the 2 best self hosted alternatives to Amazon Transcribe.

Managed automatic speech recognition (ASR) service that converts audio and video into text. Supports real-time and batch transcription, speaker diarization, timestamps, custom vocabularies, language detection, redaction, and call analytics.

Alternatives List

#1
Willow

Willow

Self-hosted voice assistant platform for ESP32 devices with on-device wake-word and command recognition, Home Assistant integration, and an optional inference server for STT/TTS/LLM.

Willow is an open-source, privacy-focused voice assistant platform designed for low-cost ESP32-S3 hardware. It provides fast on-device wake-word and command recognition and can optionally integrate with a self-hosted inference server for high-quality speech-to-text, TTS, and LLM tasks. (heywillow.io)

Key Features

  • On-device wake-word engine and voice-activity detection with configurable wake words and up to hundreds of on-device commands. (heywillow.io)
  • Integration with Home Assistant, openHAB and generic REST endpoints for home automation and custom workflows. (heywillow.io)
  • Willow Inference Server (WIS) option: a performance-optimized server that supports ASR/STT (Whisper models), TTS, and optional LLM inference with REST, WebRTC and WebSocket transports. WIS targets CUDA GPUs for low-latency workloads and includes deployment scripts and Docker compose support. (github.com)
  • Device management and OTA flashing via the Willow Application Server (WAS) with a provided Docker image to simplify onboarding. (heywillow.io)

Use Cases

  • Privacy-first smart-home voice control: local wake-word and command recognition that triggers Home Assistant automations without cloud transcription.
  • On-premises speech processing: self-hosted WIS for low-latency ASR/STT and TTS for accessibility, transcription, or edge assistant applications.
  • Developer integrations: embed Willow devices into custom REST/WebRTC workflows or use WIS to add LLM-powered assistants to local networks. (github.com)

Limitations and Considerations

  • Advanced WIS features (LLM, high-quality TTS) expect CUDA-capable GPUs and NVIDIA drivers; CPU-only setups are supported but significantly slower and may disable some features. (github.com)
  • Primary device target is the ESP32-S3-BOX family; other hardware may require additional porting or tuning. (heywillow.io)

Willow combines a small-footprint device runtime with an optional, high-performance inference server to enable private, low-latency voice assistants and on-premises speech workflows. It is actively developed with documentation, Docker deployment options, and community discussion channels for support. (heywillow.io)

3kstars
113forks
#2
Speaches

Speaches

Self-hosted, OpenAI API-compatible server for streaming transcription, translation, and speech generation using faster-whisper and TTS engines like Piper and Kokoro.

Speaches screenshot

Speaches is an OpenAI API-compatible server for speech-to-text, translation, and text-to-speech, designed to be a local “model server” for voice workflows. It supports streaming and realtime interactions so applications can transcribe or generate audio with minimal integration changes.

Key Features

  • OpenAI API compatibility for integrating with existing OpenAI SDKs and tools
  • Streaming transcription via Server-Sent Events (SSE) for incremental results
  • Speech-to-text powered by faster-whisper, with support for transcription and translation
  • Text-to-speech using Piper and Kokoro models
  • Realtime API support for low-latency voice interactions
  • Dynamic model loading and offloading based on request parameters and inactivity
  • CPU and GPU execution support
  • Deployable with Docker and Docker Compose and designed to be highly configurable

Use Cases

  • Replace hosted speech APIs with a self-managed, OpenAI-compatible voice backend
  • Build realtime voice assistants that need streaming STT and fast TTS responses
  • Batch transcription/translation pipelines for recordings with optional sentiment analysis

Speaches is a practical choice when you want OpenAI-style endpoints for voice features while retaining control over models and infrastructure. It fits well into existing OpenAI-oriented application stacks while focusing specifically on TTS/STT workloads.

2.8kstars
356forks

Why choose an open source alternative?

  • Data ownership: Keep your data on your own servers
  • No vendor lock-in: Freedom to switch or modify at any time
  • Cost savings: Reduce or eliminate subscription fees
  • Transparency: Audit the code and know exactly what's running