Audio & TTSRead-along sync, selection-to-tutor

Word-karaoke TTS and selection-to-tutor.

ArthurAI™ ships rich audio and selection-driven UX as a first-class capability, not an accessibility afterthought. Lesson blocks generate per-block neural audio via Azure Speech, cached in tenant blob storage, with word-level timing metadata for synchronised read-along highlighting. A floating audio control supports 0.75×–2× playback speed. Highlighting any text in a lesson surfaces a contextual toolbar with "Read aloud" and "Ask Arthur" — bridging passive reading and active tutoring.

The audio layer

Five things the audio pipeline does that most don’t.

  • Per-block neural audio

    Each lesson block synthesises its own audio. Granularity matters — pause, skip, repeat at the unit of pedagogy.

  • Word-level timing

    The TTS pipeline returns time-aligned word positions alongside the audio. The reader sees each word highlight as it is spoken.

  • Hash-based caching

    Tenant blob storage caches audio keyed by content hash. The same paragraph never synthesises twice. Cache hits are instant.

  • Speed control

    0.75× · 1× · 1.25× · 1.5× · 2× cycling. Per-student preference persisted.

  • Voice catalogue

    Filter by locale, gender, and style — the voice itself adapts to the institution and the learner.

Highlight any text

Two actions surface when the learner selects text inside a lesson.

  • "Read aloud"

    Highlights a passage. Plays it through the same neural TTS pipeline with word-level sync. Useful for emerging readers, learners with dyslexia, and L2 learners.

  • "Ask Arthur"

    Highlights a passage that confused the learner. Opens the tutor with that exact passage as starting context. The lesson and the tutor bridge at the moment of confusion — not at a separate help surface.

Code is truth
  1. 01

    Neural TTS via Azure Speech adapter; voice catalogue filterable by locale, gender, and style.

  2. 02

    Per-block audio cached in tenant blob storage with hash-based invalidation; cache hits return instantly without re-synthesis.

  3. 03

    Word-level timing metadata returned alongside audio; client renders synchronised word highlights during playback.

  4. 04

    Floating audio control with 0.75× / 1× / 1.25× / 1.5× / 2× speed cycling.

  5. 05

    TextSelectionTooltip: floating pill on any text selection within a lesson, with "Read aloud" and "Ask Arthur" actions.

  6. 06

    Per-student TTS preferences persisted (voice, speed, auto-play); preference changes audited.