Man Vs. Machine For Voice Over In eLearning - Part 2: How TTS Technology Can Enhance eLearning

Man Vs. Machine For Voice Over In eLearning, Part 2: How TTS Technology Can Enhance eLearning
everything possible/
Summary: Authoring tools, like Captivate and Storyline, include Text to Speech (TTS) to their environments. but usually only for English and not always with “tune-able” results. This article explores TTS technology to see what the future can bring from this technology and how it can impact the eLearning experience.

Enhancing eLearning With TTS Technology

In Man Vs. Machine For Voice Over In eLearning, Part 1 , we explored the need for voice actors and why they do best when presented with directions and scripts that have real guidance. In part 2 of this series, we'll explore how TTS technology can enhance eLearning initiatives.

Let's step back and look at the history of voice technology.

The Evolution Of Voice Technology

TTS is a relatively ubiquitous technology in the world of telecom. It was first introduced in 1939 at the World’s Fair by Bell Labs. The New York Times wrote, describing the machine's operation, "My God, it talks." Talking machines have been evolving ever since. In 1962, John L. Kelly created a “vocoder” speech synthesizer and recreated the song “Bicycle Built for Two”. Arthur C. Clarke happened to be visiting a friend at the lab, and caught the demonstration; it made it into his novel and the subsequent screenplay for “2001: A Space Odyssey” where the iconic supercomputer, the HAL9000, sings it as he is deactivated. The machine voices have at times fascinated us and as they got better at imitating our voices, sometimes they terrified us.

Speaking machines are no longer science fiction. Some of us have daily interaction with intelligent agents like Siri and Alexa, and Google’s driving directions aren’t just for getting around Silicon Valley. It's a part of our lives. Interactive Voice Response [IVR] systems have really been the foundation for Machine Voice. They replaced operators in call centers, they now can listen, talk, repeat bank statements, take payments over the phone, and just about anything a human employee can. For eLearning we really need to ask “Are we ready to replace voice actors with machines?”

They are not perfect, they have been at times deeply flawed and, in the past, sounded primitive. It also seems that we tend to forget how technology advances on its own very rapid scale. We still treat items such as Machine Translation and Text to Speech [TTS] as if we had just landed on the moon, we forget that this technology is almost 80 years old. A public pay phone is a rare sight these days; telephones are in our pockets. In short, it’s a good time to re-assess the state of technology around voice-systems. Talking machines were improved by way of Artificial Intelligence programs in Telecom. TTS had a “normal” development cycle until 2015. Then it converged with Machine learning and Big Data in the old problem of generating speech was revisited by AI practitioners. Natural Language Processing and lots of data in 2016 made TTS smarter. More has changed in the last 3 years than in the past 75 in TTS.

Focusing on the phone for a moment, and both Android and iOS have entire languages setup to understand you and talk back to you. Unfortunately, you have probably received unsolicited calls and the entire operation was machine-run, including the amazing new offer, you stopped listening to the second you realized it was a recording. There are some that stop and say “Can you hear me?” or wait for your reply like a human would. That type of automated/scripted interaction is a mix of AI and TTS. But is that good enough for eLearning?

Why Voice Matters

Let’s set aside the AI-logic [which makes an interesting article on its own] and focus on the delivery vehicle: Voice.

If we go back to the main premise of having communication on at least two fronts, voice and text, then yes, it checks the box where you have spoken words. But there are many components to voice:

  • Should it be Male or Female? Should it be recorded in both, or have a voice that is indistinguishable?
  • What kind of tone? Should it be excited, relaxed, or flat?
  • What breathing pattern and pace? Fast, slow, or rhythmic?
  • What type of pronunciation or accent? Southern, Canadian, etc.?

Think back to the early days of driving with Google as your navigator. Remember when the voice would mangle the names of streets or totally mispronounce cities? Or, what about when the navigator says "Recalculating" and you feel as though the app is mad at you for not taking that left turn. It is often preceived as personal because the TTS system is overly impersonal.

Speech Synthesis Markup Language [SSML]

TTS has a solution for that, it’s called Speech Synthesis Markup Language [SSML] and it allows for emphasis, substitutions and the use of phonemes and other tricks.

With modern TTS systems, telling the machine how to pronounce "tomato" is easy. You simply tell it to "Toe-may-toe" or "Toh-ma-tah." In the southern US, a pecan tree pronunciation may be taught this way:


You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.

I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.


The tool is the phoneme, which is best defined as a building block for sounds made by people. The funny alphabet is the International Phonetic Alphabet; it captures the sounds that human voices [mouths, lips, etc] make. You can encode just about any human-made sound and play it back.

If that’s a name of a company, a brand, or a person it’s important to be able to have a pronunciation guide for what it is that they “should” be called. Sometimes the TTS system will guess at the pronunciation of a word based on its training and that can be bad if it’s a well-known sound. Also, some words are pronounced depending on how you use them: “Bass” is a fish or a type of musical instrument. You can now distinguish to a very specific degree how things should sound.

These systems are completely customizable in several ways: Models of language, voices, and sounds generated, and modelling around other speakers. Speech Synthesis Markup Language; this allows several customizations around:

  • Pauses in reading
  • Rate of speech
  • Pitch of voice
  • Length of vocal tract [deeper voice]
  • Language used [useful for when reading English or foreign names]
  • And pronunciation “fixes” with phonemes using Phonetics
  • Visual sync with lip movements [visimes]
  • Parts of speech [“will you read the book” vs. “I already read the book”]

Clearly, there are options.  But, how do you choose between the two?

The factors that typically push one into TTS are simple; Demand is greater than Capacity. Meaning, the amount of voice over work is greater than the ability to hire human actors. This doesn't mean that all jobs are split up this way; only that some jobs only can be served by TTS.

The TTS systems tend to be customized for the lexicon [dictionary] and a few hours of engineering time is spent fixing “bugs” for each 30 min of audio. Still, this rate is substantially less than the traditional booking of talent.

The flipside is that humans have prosody, that is the term used for naturally variant speech patterns and differences in intonation, pitch, speed, loudness, etc. The things that give richness to the voice. This is 100% available with a studio session. However, it’s not so available in TTS unless you put in hours of work on minutes of audio.

The recommendation is to ask an expert in eLearning and also validate the cost/benefit from being in more languages. Most learners will probably forgive the TTS if it means they can listen to the lesson instead of reading a transcript.

In other cases, a professional voice over gives the lesson a certain level of polish that’s hard to replace; but this comes at a cost. One important observation that should be shared is that these things cost less with scale to a point.

Book ten minutes of studio time, the talent will be there for an hour; so, why not 25? Or 30? These additional minutes get “bundled” into the basic “show up” fee and as a cost per minute the rate goes down the more you do. It’s like when you buy an extra-large pizza and share it with everyone. You end up with bulk savings. For individual Instructional Designers, this could mean bundling up 2-3 courses at a time; for organizations, learning how to coordinate language launches this is common practice. If you record all the Japanese courses at once and you pay less overall than if you had done it one by one.

Unfortunately, getting the stars to align on multiple projects doesn’t always happen but it’s still a valid cost optimization strategy. As for TTS? That's not really the same. It's a flat rate almost, the more minutes, the more engineering. Maybe optimization happens but there’s never a booking fee to deal with and adding bits and parts doesn’t give you the same initial costs.

The Future Of TTS Is Now

For the last few years Google and Microsoft have been experimenting with custom language variants, where you can provide a voice model and it’s grafted onto a TTS. Imagine a way to re-take and redo scenes in movies after the actor has left the production, or correct flubs that would otherwise be perfect. Adobe in November of 2016, unveiled a technology called “VoCo” at an event with a guest actor. At this event they took the voice of the presenter, actor Jordan Peele and showed him “photoshop for voice”. The technology could imitate the actor saying anything. This technology faced a large backlash from people concerned for its potential for misuse. Mark Randall, VP of Creativity at Adobe replied saying:

“That’s because, at its core, technology is an extension of human ability and intent. Technology is no more idealistic than our vision for the possible nor more destructive than our misplaced actions. It’s always been that way.”

There hasn’t been anything else published on the project since then.

Also, in September of 2016, Google released Deep Mind WaveNet, which unlike the traditional “ransom letter” style of TTS voice outputs, snippets of audio jumbled into words, it was actually modelled after real speech and sounded like it. This Neural Network speech generation technology is what the most modern TTS are used for today. But cloning voices, altering normal speech by typing in different words are yet to come. There is also work on lip-sync and dubbing side when you add computer vision [reading lips] to transcription or take the “faked” clone voice to clone “lip movements” and further erode the ability of humans to be the gold standard for voice over.

Recently, we have been able to “patch” audio by using TTS to fix small errors in a voice over with an edit. This is nothing new for audio editing, but it is new since we no longer have to bring in the talent to rerecord a line in the eLearning course. Stand-alone words like “Next” or “Question 2” are also safe enough in an eLearning test environment that TTS is perfectly suited to deliver in 1 hour what it would take a studio 2 hours + the time to find a talent [days]. These patches are limited since if it’s a long utterance a voice actor still outperforms TTS.

It's also changing the overall landscape for voice artists. A startup in Montreal has been developing a “voice banking” tool. Imagine if the all your eLearning catalog was voiced by your charismatic training director. How could you keep making more voice overs than her schedule allows? How about after she’s long-gone and you still want to use her voice? It’s now possible to create a model of a real person’s voice and then use that in TTS. Like the Adobe example, it’s open to ethical questions which we are barely starting to ask. Does the compensation model become a royalty model? Does the voice of the artist become intellectual property for the business that created it with total rights?

Currently, the solutions for voice banks involve preserving people’s voice when they are facing cancer where they would lose their ability to speak. Famously, the movie critic Roger Ebert lost his voice but through an early version of this technology it was able to be rebuilt with hours of audio that he had produced. These projects used to be monumental efforts of months of recordings and engineering. With the advances in the last 2 years, it’s now only 2 hours of voice recordings and a few hours of processing.


For eLearning voice overs, it will be the status quo for the next few years until TTS technology becomes ubiquitous and “voice repair” options become mainstream for “retakes”. This, much like it’s done with other automatable tasks, calling the voice actor for a “redo” will be less likely. In addition, premium voice banks will be sold or made for niche markets and they will sound like real people. Those actors will still have a profession and an ability to license their voice.

Some TTS systems today work on a license model [think automated systems like elevators] where the same recording will be used a million times. For eLearning, these external elements won’t make a huge difference except to reduce the cost of entry to certain markets and make maintenance of annual mandatory training less expensive since the same voice can be edited and the new details added in minutes for all languages.

Many courses today are perfectly happy to have TTS included not just as an assistive [think screen reader for the blind], but more of a standard voice. Eventually, it will be better quality and good enough that narration will become as ubiquitous as “color graphics” or “air-conditioning” or anything else that was once future high-tech from the world’s fair.