Play.ht

Best Self Hosted Alternatives to Play.ht

A curated collection of the 3 best self hosted alternatives to Play.ht.

Cloud text-to-speech platform that converts text into realistic, multi-speaker audio. Offers voice cloning, speech styles, SSML/pronunciation controls, multi-language support, multi-voice dialogues, and a low-latency API for integration into apps, videos, podcasts, IVR and localization.

Alternatives List

#1
LocalAI

LocalAI

Run LLMs, image, and audio models locally with an OpenAI-compatible API, optional GPU acceleration, and a built-in web UI for managing and testing models.

LocalAI screenshot

LocalAI is a self-hostable AI inference server that provides a drop-in, OpenAI-compatible REST API for running models locally or on-premises. It supports multiple model families and backends, enabling text, image, and audio workloads on consumer hardware, with optional GPU acceleration.

Key Features

  • OpenAI-compatible REST API for integrating with existing apps and SDKs
  • Multi-backend local inference, including GGUF via llama.cpp and Transformers-based models
  • Image generation support (Diffusers/Stable Diffusion-class workflows)
  • Audio capabilities such as speech generation (TTS) and voice-related features
  • Web UI for basic testing and model management
  • Model management via gallery and configuration files, with automatic backend selection
  • Optional distributed and peer-to-peer inference capabilities

Use Cases

  • Replace cloud LLM APIs for private chat and internal tooling
  • Run local multimodal prototypes (text, image, audio) behind a unified API
  • Provide an on-prem inference endpoint for products needing OpenAI API compatibility

Limitations and Considerations

  • Capabilities and quality depend heavily on the selected model and backend
  • Some advanced features may require GPU-specific images or platform-specific setup

LocalAI is a practical foundation for building a local-first AI stack, especially when OpenAI API compatibility is a requirement. It offers flexible deployment options and broad model support to cover common generative AI workloads.

42.1kstars
3.4kforks
#2
ebook2audiobook

ebook2audiobook

Self-hostable tool to convert non-DRM eBooks into audiobooks with chapter support, metadata, multilingual TTS engines, and optional voice cloning via a web UI or CLI.

ebook2audiobook is a tool for generating audiobooks from non-DRM, legally acquired eBooks using multiple text-to-speech (TTS) engines. It can run with a Gradio web interface or in headless/CLI mode, and supports multilingual narration with optional voice cloning.

Key Features

  • Converts many input formats including EPUB, MOBI/AZW3, FB2, PDF, DOC/DOCX, HTML, RTF, TXT, and image-based documents
  • OCR support for scanned pages and image-based eBooks
  • Multiple TTS engine options (including XTTSv2 and others) with broad language coverage
  • Optional voice cloning using a provided reference voice file
  • Supports custom XTTSv2 model uploads (e.g., zipped model artifacts)
  • Outputs common audiobook/audio formats including MP3, M4B, M4A, AAC, FLAC, OGG, WAV, and WebM
  • Runs on CPU or accelerators (CUDA and other backends depending on environment)

Use Cases

  • Converting personal eBook libraries into listenable audiobooks with chapters and metadata
  • Producing multilingual narration for accessibility, language learning, or travel
  • Creating custom-voice narration for personal use using voice cloning

Limitations and Considerations

  • Intended for non-DRM, legally acquired eBooks; DRM-protected sources require separate lawful handling
  • OCR quality and document structure (especially EPUB chapter boundaries) can affect chapter splitting and narration results

It is well-suited for users who want a local web UI and batch-capable CLI for audiobook generation, while keeping flexibility in TTS engines, languages, and output formats. With GPU acceleration and suitable TTS models, it can significantly improve throughput and audio quality for larger books.

17kstars
1.4kforks
#3
Speaches

Speaches

Self-hosted, OpenAI API-compatible server for streaming transcription, translation, and speech generation using faster-whisper and TTS engines like Piper and Kokoro.

Speaches screenshot

Speaches is an OpenAI API-compatible server for speech-to-text, translation, and text-to-speech, designed to be a local “model server” for voice workflows. It supports streaming and realtime interactions so applications can transcribe or generate audio with minimal integration changes.

Key Features

  • OpenAI API compatibility for integrating with existing OpenAI SDKs and tools
  • Streaming transcription via Server-Sent Events (SSE) for incremental results
  • Speech-to-text powered by faster-whisper, with support for transcription and translation
  • Text-to-speech using Piper and Kokoro models
  • Realtime API support for low-latency voice interactions
  • Dynamic model loading and offloading based on request parameters and inactivity
  • CPU and GPU execution support
  • Deployable with Docker and Docker Compose and designed to be highly configurable

Use Cases

  • Replace hosted speech APIs with a self-managed, OpenAI-compatible voice backend
  • Build realtime voice assistants that need streaming STT and fast TTS responses
  • Batch transcription/translation pipelines for recordings with optional sentiment analysis

Speaches is a practical choice when you want OpenAI-style endpoints for voice features while retaining control over models and infrastructure. It fits well into existing OpenAI-oriented application stacks while focusing specifically on TTS/STT workloads.

2.8kstars
356forks

Why choose an open source alternative?

  • Data ownership: Keep your data on your own servers
  • No vendor lock-in: Freedom to switch or modify at any time
  • Cost savings: Reduce or eliminate subscription fees
  • Transparency: Audit the code and know exactly what's running