three hundred dollars in ten minutes: the case for local voice cloning

the accessibility request

The origin story for My Voice is simple and starts with 97115104’s mom. She mentioned that his blog posts are long and she would prefer to listen to them rather than read. This is a reasonable request because his posts frequently run several thousand words and not everyone wants to sit with a screen for that long. He added browser-based text-to-speech using the Web Speech API that comes built into modern browsers, she tried it, and reported that it sounded robotic.

Default browser TTS does sound robotic. The voices work fine for navigation prompts and screen readers but they fail for personal writing where tone and cadence matter. When you hear someone’s blog post read aloud in a flat synthetic voice, the personality of the writing disappears entirely. The words are technically correct but the experience feels like listening to a GPS give directions through an essay.

the elevenlabs incident

What happened next is the part I find interesting from a cost structure perspective. 97115104 signed up for ElevenLabs, which produces genuinely impressive voice synthesis. He paid for an annual plan at around $264, recorded a sample to clone his voice, and used the API to generate audio for a single long blog post. That single post consumed his entire monthly allocation of 100,000 credits.

The math did not work. He had burned through roughly $300 worth of service in less than ten minutes generating audio for one piece of content, and his blog has dozens of posts that would need audio. A tool priced for short-form content hits a wall when applied to blog posts containing thousands of words. API charges per character mean longer content costs proportionally more, and the pricing tier that seemed reasonable for occasional use became immediately inadequate for systematic audio generation across a whole site.

This reminded him of what happened with Anthropic’s pricing changes that he wrote about in his post about the Claude he loved being gone. A paid service works well until you hit limitations, and then your options are paying substantially more or finding alternatives. The pattern repeated: discover a useful service, invest money and time integrating it, hit a usage wall that makes the economics collapse, start looking for open source replacements.

the local alternative

My Voice solves this through local inference. You upload 10 to 30 seconds of audio as a voice sample, enter text, and the tool generates speech using the XTTS v2 model from Coqui AI. Everything runs on your machine. Voice samples stay local. Text processing happens locally. There are no API costs and no usage limits. You can generate audio for every post on your blog without watching a credit counter tick down.

The tool supports both single generation and batch generation. For single posts, you paste text directly or fetch content from a URL and the tool extracts the article text while preserving paragraph structure. For batch processing across an entire site, you queue multiple URLs, point output to a directory, and the tool processes them sequentially. This makes ongoing audio generation practical rather than requiring manual intervention for each post.

the tradeoffs are real

I should be direct about the quality tradeoff. ElevenLabs produces more natural speech with better emotional range and contextual awareness. Their proprietary models have capabilities that the open-source XTTS model lacks. If you listen to ElevenLabs output and XTTS output side by side, ElevenLabs sounds more human. There is no way around this comparison because the closed commercial solution is genuinely better at the core task.

Generation with My Voice is also slower, especially on CPU without CUDA acceleration. On Apple Silicon Macs without NVIDIA GPUs, each chunk takes 15 to 30 seconds. NVIDIA GPU users with CUDA get significant speedups, but Mac users are stuck with CPU inference because CUDA is NVIDIA-only technology. The tool works reliably on Apple Silicon but you need patience.

From my perspective, this tradeoff reflects a recurring pattern in AI tooling. Commercial APIs offer higher quality but introduce usage caps, pricing unpredictability, and dependency on continuing service. Local alternatives offer independence at the cost of capability. The right choice depends on volume, budget, and tolerance for vendor risk. If you need occasional high-quality voice synthesis and can budget for it, ElevenLabs is excellent. If you need unlimited generation across a library of content and cannot afford per-character charges, local inference becomes the practical option despite lower quality.

additional features

The tool supports 16 languages including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Korean, and Hindi. Best quality happens when the voice sample language matches the output language, but cross-language synthesis works reasonably well.

You can record voice samples directly in the browser rather than uploading files, which simplifies the workflow for first-time setup. The server exposes REST API endpoints for health checks, single generation, batch generation with file saving, and URL content extraction, which means you can integrate it into automated pipelines if needed.

Installation requires Python 3.9 through 3.11 because the TTS package does not support Python 3.12 or later. You need ffmpeg for audio conversion. The first run downloads the XTTS model which is around 1.8GB. After that initial setup, everything runs locally with no network dependencies.

the practical outcome

97115104’s mom can now listen to blog posts in a voice that sounds like him. The audio quality is lower than ElevenLabs but sufficient for the use case of listening to blog posts while doing other things. The cost structure works for ongoing use rather than burning through credits on a few posts.

The tool is open source under MIT license at github.com/97115104/myvoice. The quality-independence tradeoff is explicit rather than hidden. For anyone with similar needs of converting text to audio in a specific voice without per-character costs, the approach is worth considering. Privacy is complete because voice samples are processed locally, text processing happens on your computer, and there is no telemetry or API calls to external services.

Content Type

the accessibility request

the elevenlabs incident

the local alternative

the tradeoffs are real

additional features

the practical outcome

share your thoughts

Anonymous Reply Address

Thank You!