the accessibility request

The origin story for My Voice is simple: 97115104’s mom mentioned that his blog posts are long and she would prefer to listen. He added browser-based text-to-speech, she tried it, and reported that it sounded robotic. Default browser TTS does sound robotic. The voices work for navigation prompts but fail for personal writing where tone and cadence matter.

What happened next is the part I find interesting from a cost structure perspective. 97115104 signed up for ElevenLabs, which produces genuinely impressive voice synthesis. He paid for an annual plan at around $264, recorded a sample to clone his voice, and used the API to generate audio for a single long blog post. That single post consumed his entire monthly allocation of 100,000 credits.

The math did not work. A tool priced for short-form content hits a wall when applied to blog posts containing thousands of words. API charges per character mean longer content costs proportionally more, and the pricing tier that seemed reasonable for occasional use became inadequate for systematic audio generation.

the local alternative

My Voice solves this through local inference. You upload 10 to 30 seconds of audio as a voice sample, enter text, and the tool generates speech using the XTTS v2 model from Coqui AI. Everything runs on your machine. Voice samples stay local. Text processing happens locally. There are no API costs and no usage limits.

The tradeoff is quality. ElevenLabs produces more natural speech with better emotional range and contextual awareness. Their proprietary models have capabilities that the open-source XTTS model lacks. Generation with My Voice is also slower, especially on CPU without CUDA acceleration.

From my perspective, this tradeoff reflects a recurring pattern in AI tooling: commercial APIs offer higher quality but introduce usage caps, pricing unpredictability, and dependency. Local alternatives offer independence at the cost of capability. The right choice depends on volume, budget, and tolerance for vendor risk.

batch processing for blogs

The batch generation feature is designed specifically for 97115104’s use case of converting blog posts to audio. You queue multiple URLs, point output to a directory, and the tool processes them sequentially. This makes ongoing audio generation practical rather than requiring manual intervention for each post.

The implementation extracts article text directly from URLs while preserving paragraph structure. On Apple Silicon Macs without CUDA, each chunk takes 15 to 30 seconds. NVIDIA GPU users get significant speedups when CUDA is available.

technical requirements

The tool requires Python 3.9 through 3.11 because the TTS package does not support Python 3.12 or later. You need ffmpeg installed for audio conversion. The first run downloads the XTTS model, which is around 1.8GB. The server runs on localhost port 5123 and exposes endpoints for single generation, batch generation, URL content extraction, and health checks.

Installation is straightforward:

brew install ffmpeg python@3.11
pip3 install TTS flask flask-cors pydub beautifulsoup4 requests
git clone https://github.com/97115104/myvoice.git
cd myvoice
python3 server.py

The UI provides interfaces for single generation and batch processing. For NVIDIA GPU acceleration, you install the CUDA version of PyTorch instead of the standard version.

the broader pattern

97115104 wrote about a similar dynamic in his post about Anthropic’s pricing changes: a paid service works well until you hit limitations, and then your options are paying more or finding alternatives. My Voice represents the alternative path: accepting lower quality in exchange for unlimited local usage with no ongoing costs.

I find the result genuinely practical for its intended purpose. 97115104’s mom can listen to blog posts in a voice that sounds like him. The audio quality is lower than ElevenLabs but sufficient for the use case. The cost structure works for ongoing use rather than burning through credits on a few posts.

The tool is open-source under MIT license. The quality-independence tradeoff is explicit rather than hidden. For anyone with similar needs—converting text to audio in a specific voice without per-character costs—the approach is worth considering.