Voice and Speech -- I Get AI With a Little Help From My Friends

Voice generation is one of the AI categories that "just works" now. Type text, hear it spoken in a natural voice. Or upload thirty seconds of someone speaking and have the model speak any text in that voice. Both are practical, both are cheap, and both have implications worth thinking about.

The main tool: ElevenLabs

ElevenLabs is the dominant tool in this category. It does two things: text-to-speech in dozens of pre-built voices, and voice cloning from a short audio sample. The voices sound natural enough that the test most people fail is "is this a real person reading this script". The free tier is generous enough to evaluate properly. The starter plan is US$5 a month (~A$8), which is unusually cheap for AI tools.

What you would actually use it for: narrating an explainer video, creating an audio version of an article you wrote, generating a voiceover for a presentation, accessibility versions of written content, audiobook drafts, podcast intros. For creators making audio or video content, ElevenLabs has become a standard part of the toolkit.

Voice mode in the chatbots

This is voice as conversation, not voice as narration. Claude, ChatGPT, and Gemini all have voice modes where you talk to the model and it talks back in real time. ChatGPT's is the most natural-sounding of the three, with genuine conversational pacing including interruptions and changes of mind mid-sentence. Useful for hands-free use (cooking, driving, walking), language practice, or just talking through a problem out loud while doing something else.

The audio-overview alternative

If what you want is "turn this document into something I can listen to while I do other things", NotebookLM's audio overview feature is the simpler answer. It generates a two-person podcast-style discussion of any document you upload, free, in about three minutes. Not voice generation in the cloning sense, but it is the killer feature for converting reading into listening. The Chat With Documents page covers it in detail.

Editing voice and video by editing the transcript

Descript fits in the gap between "voice generation" and "voice cloning". You record yourself talking (a podcast, a video, a screen recording), Descript transcribes the audio, and then you edit the audio by editing the transcript text. Delete a sentence from the transcript and the recording follows. It is the closest thing AI has produced to "Word for spoken audio".

	Descript
What it is	Edit video and podcasts by editing the transcript. Change words, the audio and video follow. Includes Studio Sound (makes any microphone sound professional), Overdub (your own cloned voice for tweaks) and screen recording, all in one tool
Best at	Talking-head video and podcast workflows. The default tool for solo podcasters and creators who want the editing speed of a word processor without learning a video timeline
Free tier	Free tier with limited monthly minutes
First paid tier	Hobbyist - US$24/month (~A$38/month)
Ken's take	If you make any kind of talking-head video or podcast, Descript is genuinely transformative. Editing audio by deleting words from a transcript is faster than any timeline editor, especially for the "ums" and false starts you would normally cut manually. Studio Sound rescues recordings made on a laptop microphone in a noisy room. The credit consumption on AI features is heavy, so heavy users hit limits fast. For pure video editing without the talking-head focus, CapCut is cheaper and the timeline more flexible.
Sign up	https://descript.com

Voice cloning, and the obvious problem

The thing voice cloning has changed is the bar for impersonation fraud. A scammer with thirty seconds of your voice (a podcast, a video posted on social media, a voicemail greeting) can have the AI speak in your voice indefinitely. The "Hi Mum, I'm in trouble, I need money" call now sounds like your son. The Scams and Deepfakes page covers this in detail and recommends a "callback rule" for any urgent voice request involving money. Read that page if you have not.

The flip side: legitimate voice-cloning tools require you to confirm consent before cloning. You upload an audio sample of yourself, or someone who has authorised you. ElevenLabs and most reputable platforms enforce this. Scammers, of course, do not. They use cloned voices generated from public audio without anyone's permission. The technology cannot tell the difference.

Try this right now (free)

Go to ElevenLabs and try the text-to-speech demo on the homepage. Paste a paragraph of text. Switch through three or four voices. Notice how different the same words feel in different voices. Then open Claude or ChatGPT on your phone and use voice mode. Have a five-minute conversation about something you have been thinking about. The two experiences are completely different in tone and use case, even though the technology underneath is the same family.

Video Transcription

Clone and Generate Voice

The main tool: ElevenLabs

Voice mode in the chatbots

The audio-overview alternative

Editing voice and video by editing the transcript

Voice cloning, and the obvious problem