Audio & TTSRead-along sync, selection-to-tutor

Word-karaoke TTS and selection-to-tutor.

ArthurAI™ ships rich audio and selection-driven UX as a first-class capability, not an accessibility afterthought. Lesson blocks generate per-block neural audio via Azure Speech, cached in tenant blob storage, with word-level timing metadata for synchronised read-along highlighting. A floating audio control supports 0.75×–2× playback speed. Highlighting any text in a lesson surfaces a contextual toolbar with "Read aloud" and "Ask Arthur" — bridging passive reading and active tutoring.

The audio layer

Five things the audio pipeline does that most don’t.

Per-block neural audio
Each lesson block synthesises its own audio. Granularity matters — pause, skip, repeat at the unit of pedagogy.
Word-level timing
The TTS pipeline returns time-aligned word positions alongside the audio. The reader sees each word highlight as it is spoken.
Hash-based caching
Tenant blob storage caches audio keyed by content hash. The same paragraph never synthesises twice. Cache hits are instant.
Speed control
0.75× · 1× · 1.25× · 1.5× · 2× cycling. Per-student preference persisted.
Voice catalogue
Filter by locale, gender, and style — the voice itself adapts to the institution and the learner.

Highlight any text

Two actions surface when the learner selects text inside a lesson.

"Read aloud"
Highlights a passage. Plays it through the same neural TTS pipeline with word-level sync. Useful for emerging readers, learners with dyslexia, and L2 learners.
"Ask Arthur"
Highlights a passage that confused the learner. Opens the tutor with that exact passage as starting context. The lesson and the tutor bridge at the moment of confusion — not at a separate help surface.

Code is truth

01
Neural TTS via Azure Speech adapter; voice catalogue filterable by locale, gender, and style.
02
Per-block audio cached in tenant blob storage with hash-based invalidation; cache hits return instantly without re-synthesis.
03
Word-level timing metadata returned alongside audio; client renders synchronised word highlights during playback.
04
Floating audio control with 0.75× / 1× / 1.25× / 1.5× / 2× speed cycling.
05
TextSelectionTooltip: floating pill on any text selection within a lesson, with "Read aloud" and "Ask Arthur" actions.
06
Per-student TTS preferences persisted (voice, speed, auto-play); preference changes audited.

Multi-language pipeline →All capabilities →

Word-karaoke TTS and selection-to-tutor.

Five things the audio pipeline does that most don’t.

Per-block neural audio

Word-level timing

Hash-based caching

Speed control

Voice catalogue

Two actions surface when the learner selects text inside a lesson.

"Read aloud"

"Ask Arthur"