Man Vs. Machine For Voice Over In eLearning, Part 1

Man Vs. Machine For Voice Over In eLearning
Monkey Business Images/
Summary: The complexities of communication add to the dynamics of man vs. machine, when it comes to learning, translation and speech.

Man Vs. Machine For Voice Over In eLearning: How To Decide Between The Two Solutions

Before we examine the advantages of humans vs. machines for speech, we should ask why we are adding speech to our eLearning content, to begin with. This is a two-part article; first focusing on the well-established Human Voice Over [VO] and the relatively new Text To Speech [TTS] systems. We cover the pros and cons of each method and advances in the field.

Audio And Writing In eLearning

Let's first consider why we add audio to eLearning content. The premise is simple: adding another mode of communication can enrich the experience of a learner. Adding color, graphics and sound can give cues and reinforce concepts that are sometimes ambiguous or too densely packed into words. Phrases such as "Sounds good, I hear you" inadvertently add speaking as a variant of understanding.

Commonly it has been taught to “write how you speak”, but good writing is distinct from good speaking, the rules are different. You don’t just write a script like you write a passage in a book, at least if you want to make it sound the least bit engaging. The main culprit is punctuation, a comma versus a period versus a hyphen – all are distinct in the “spoken word”, most are not in eLearning. In writing, we tend to be unaware of how these little changes affect meaning and speech. However, speaking has a clear intent and outcome – that makes our eLearning more engaging and, for most audiences, it’s not-negotiable. Now you have two modes of communication working together to produce an even better output; more engaging and richer with meaning. It’s also easier to get away with a “bad” script if you happen to be both the author and the narrator.

Scripts for eLearning are not exceptionally difficult to author or produce as a voice over. You must pace the delivery of the content, make sure the person who does the voice over sounds like they know what they are talking about, and hopefully has even taken the time to learn how to pronounce things; that, a good cardioid microphone, and a quiet room are all you need.

The Cost Of Professional Touch

Whenever something has so few simple requirements, it usually means that it is nearly impossible to do well. The difference between someone casually recording on their desktop versus a booked studio artist is night and day – it's obvious to the casual learner. Voice artists don’t just magically appear; they train, go to coaches, learn accents, unlearn accents, take breathing lessons and acting lessons and train and practice, for years. They are rare, they are expensive, and many deserve the recognition for their talent and hard work. They are also not likely to be used on most training courses, as the project can't afford them, or they aren’t available. So, people typically go with a slightly less qualified (less prestigious) talent, or someone who can speak Japanese next Tuesday at 10 AM; because, while we respect the art, deadlines are real too.

The voice talents are typically available online as home-recorders (cheapest) up to the professional studio (most expensive) with the fanciest equipment which requires a physical trip to the studio. There’s a subtle cost gradient between individuals, their infrastructure until a sudden spike into “broadcast” territory. This is because a very talented artist can make a living off a series or commercial broadcasting rights; sometimes, eLearning courses just take too much valuable time for some talents. The gamut of resources is part of the reason why most companies engage in voice overs with a very limited scope – it's quickly revealed to be a complex and costly environment.

The Factor Of Availability

Adding to this mix of variables is the simple availability factor – people have limited schedules and get sick or go on vacation. So it's always best to have at least one alternate voice in mind, just in case. On the positive side, it can be said that courses with some variation in speakers can use that to keep learners engaged. It's not the same person as last time, it’s slightly different than expected, what else is new? This again is best planned and not encountered. A typical voice artist can do about 3-4 hours before they need a break; in longer engagements (multiple courses, for example) they might need rest-days or risk real damage to their livelihood.

The challenge usually ends up being finding the right talent for the job where cost, availability, accent, gender, and language all intersect. It is not easy but perfectly doable, it’s also possible to avoid extra costs if you are able to source with a studio and get a negotiated flat rate pricing – this works particularly well if you are not “picking the voice of the brand” or engaged in marketing efforts; which in eLearning, you aren’t. Choosing talent shouldn’t be a big Hollywood casting call with dozens of rounds and qualifying conditions – most if not all voice over artists are professional and capable enough to do an eLearning course, and with the right directorial guidance, it’s easy to do it right.

The Instructional Designer Is The Voice Talent: Pros And Cons

Sometimes the voice talent is the Instructional Designer – they know the subject and what to convey and sometimes they practice too! Many companies and departments are completely fine with not using a professional studio; there’s no time, or budget constraints – and the record button is right in front of you.

Where this approach falls short is when you must expand the audience to other countries. Who’s going to learn Japanese in a weekend? How will you know if the French talent’s accent is highly regional and the Paris office will laugh at your presentation because it sounds “funny”. Maybe you are lucky and have offices where someone owes you a favor and speaks the language, but there is a better way.

The Machine Solution

Before we leave the Human world let's recognize that with the proper directorial input, an actor can pretend anything – doubt, fear, sarcasm, and any emotion properly described – it is an infinite pallet of technique; perhaps mechanically repeatable; but not yet able to be spontaneously generated.

What Will The Use Of A Human Voice Actor Bring?

Voice acting is a learned skill; it’s been done since the first radio broadcast, since the first time someone got on a stage. This is more like acting in general. It’s an art form, a talent, and a skill – animation is generally where most of the lucrative work is. Animation, like comic book heroes, is no longer niche backwaters – it's a large industry with many traditional actors also doing voice over work; the skills transfer and they don’t have to wear fancy makeup unless they want to. The quality of output from these actors is as variable as their experience and background – plus the director can make a huge difference in the output. The director can help steer the recording session and make things sound “like a radio commercial” or solemn like “a serious message” and many things in between. One thing to think about is the actor and director need guidance – how is this supposed to sound? What is it for? These requirements are sometimes not captured in a script or even in a general request – it's sometimes “just record these slides”.  Consider saying instead “record these slides so that people know the rules that keep everyone safe at our company” or “record this training about how to use the new system that saves everyone hours of work” – a brief summary, position statement, tone or anything else that the end listener (your learner) will appreciate. Using skilled talent without direction is an underutilization of a valuable resource and very simple to address with a few short sentences.

We’ve talked about how human voice talent is expressive and able to be molded into almost anything in the range of emotion and output; but what about those robot voices, the TTS?

In part two, we will look at how the TTS systems are not only able to take on accents but also how to build sounds, copy lip-movements (lip reading, dubbing) and how to address the shortcomings of the technique.