speak - Talk to your Claude!

Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon. Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.

Prerequisites

Requirement

Check

Install

Apple Silicon Mac

uname -m → arm64

Intel not supported

macOS 12.0+

sw_vers

sox

which sox

brew install sox

ffmpeg

which ffmpeg

brew install ffmpeg

poppler (PDF)

which pdftotext

brew install poppler

Input Sources

Source

Example

Text file

speak article.txt

Markdown

speak doc.md

Direct string

speak "Hello"

Clipboard

pbpaste | speak

Stdin

cat file.txt | speak

Web Articles

lynx -dump -nolist "https://example.com/article" | speak --output article.wav

Converting Formats

Format

Convert Command

PDF

pdftotext doc.pdf doc.txt

DOCX

textutil -convert txt doc.docx

HTML

pandoc -f html -t plain doc.html > doc.txt

Output Modes

Goal

Command

Save for later

speak text.txt --output file.wav

Listen now (streaming)

speak text.txt --stream

Listen now (complete)

speak text.txt --play

Both

speak text.txt --stream --output file.wav

Default Behavior

speak article.txt          # → ~/Audio/speak/article.wav (no playback)
speak "Hello"              # → ~/Audio/speak/speak_<timestamp>.wav

Directory Auto-Creation

Voice Cloning

Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.

Quality Expectations

Output captures general voice characteristics but is not a perfect replica
Quality depends heavily on sample quality
15-25 seconds is optimal (10s minimum, 30s maximum)

Recording Your Voice

Using QuickTime:

Open QuickTime Player → File → New Audio Recording
Record 20 seconds of clear speech
File → Export As → Audio Only (.m4a)
Convert to WAV (see below)

Using sox (command line):

# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25

Converting to Required Format

Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.

# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav

# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav

# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav

# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono
`### Using Your Voice`
# Create directory
mkdir -p ~/.chatter/voices/

# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav

# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream

# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav

Path requirements:

✓ Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell)
✓ Works: /Users/name/.chatter/voices/my_voice.wav
✗ Fails: my_voice.wav (relative path)
✗ Fails: ./voices/my_voice.wav (relative path)

Voice Sample Tips

Good Sample

Bad Sample

Quiet room

Background noise

Natural pace

Rushed or monotone

Clear diction

Mumbling

Varied content

Repetitive phrases

Default Voice

When --voice is omitted, a built-in default voice is used:

speak "Hello world" --stream # Uses default voice

Emotion Tags

Tags produce audible effects (actual sounds), not spoken words:

speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."

Tag

Effect

[laugh]

Laughter

[chuckle]

Light chuckle

[sigh]

Sighing

[gasp]

Gasping

[groan]

Groaning

[clear throat]

Throat clearing

[cough]

Coughing

[crying]

Crying

[singing]

Sung speech

NOT supported: [pause], [whisper] (ignored)

For pauses: Use punctuation: "Wait... let me think."

Batch Processing

mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav

# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk

# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing

Auto-Chunk Behavior

When using --auto-chunk with batch processing:

Each input file is chunked independently
Chunks are generated and automatically concatenated per file
Final output: one .wav per input file (e.g., ch01.wav)
Intermediate chunks deleted (unless --keep-chunks)

You don't need to manually concatenate chunks — only concatenate final chapter files.

Concatenating Audio

# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav

# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav

Zero-Padding Rules

Critical for correct concatenation order:

Files

Correct

Wrong

1-9

01, 02, ..., 09

1, 2, ..., 9

10-99

01, 02, ..., 99

1, 10, 2, ...

100+

001, 002, ..., 999

1, 100, 2, ...

Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt  # Note chapter page numbers

# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"
`### Step 2: Extract Chapters (Zero-Padded!)`
# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters
`### Step 3: Estimate Time`
speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed

# Quick estimates:
# 1 page ≈ 2 min audio ≈ 1 min generation
# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB
`### Step 4: Generate Audio`
mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav
`### Step 5: Concatenate`
speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav

PDF Troubleshooting

Issue

Solution

Empty/garbled text

Scanned PDF — use OCR: brew install tesseract

Wrong encoding

Try: pdftotext -enc UTF-8 doc.pdf

Check word count

pdftotext doc.pdf - | wc -w (should be >100)

Multi-Voice Content

mkdir -p podcast/scripts podcast/wav

echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt

speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav

speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav

Options Reference

Option

Description

Default

--stream

Stream as it generates

false

--play

Play after complete

false

--output <path>

Output file

~/Audio/speak/

--output-dir <dir>

Batch output directory

--voice <path>

Voice sample (full path)

default

--timeout <sec>

Timeout per file

300

--auto-chunk

Split long documents

false

--chunk-size <n>

Chars per chunk

6000

--resume <file>

Resume from manifest

--keep-chunks

Keep intermediate files

false

--skip-existing

Skip if output exists

false

--estimate

Show duration estimate

false

--dry-run

Preview only

false

--quiet

Suppress output

false

Commands

Command

Description

speak setup

Set up environment

speak health

Check system status

speak models

List TTS models

speak concat

Concatenate audio

speak daemon kill

Stop TTS server

speak config

Show configuration

Performance

Metric

Value

Cold start

~4-8s

Warm start

~3-8s

Speed

0.3-0.5x RTF (faster than real-time)

Storage

~2.5 MB/min, ~150 MB/hour

Resume Capability

For interrupted long generations:

# Single file with auto-chunk — use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json

# Batch processing — use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing

Common Errors

Error

Cause

Solution

"Voice file not found"

Relative path

Use full path: ~/.chatter/voices/x.wav

"Invalid WAV format"

Wrong specs

Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav

"Voice sample too short"

<10 seconds

Record 15-25 seconds

"Output directory doesn't exist"

Not created

mkdir -p dirname/

"sox not found"

Not installed

brew install sox

Scrambled concat order

Non-zero-padded

Use 01, 02, not 1, 2

Timeout

>5 min generation

Use --auto-chunk or --timeout 600

"Server not running"

Stale daemon

speak daemon kill && speak health

Setup

speak "test"     # Auto-setup on first run (downloads model ~500MB)
speak setup      # Or manual setup
speak health     # Verify everything works

Server Management

Server auto-starts and shuts down after 1 hour idle.

speak health        # Check status
speak daemon kill   # Stop manually

speak-tts

Speak Tts

speak - Talk to your Claude!

Prerequisites

Input Sources

Web Articles

Converting Formats

Output Modes

Default Behavior

Directory Auto-Creation

Voice Cloning

Quality Expectations

Recording Your Voice

Converting to Required Format

Voice Sample Tips

Default Voice

Emotion Tags

Batch Processing

Auto-Chunk Behavior

Concatenating Audio

Zero-Padding Rules

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

PDF Troubleshooting

Multi-Voice Content

Options Reference

Commands

Performance

Resume Capability

Common Errors

Setup

Server Management