Creating a Natural Voice using Text to Speech


If you’ve used one of the newer text to speech services, you’ve witnessed the huge improvements this industry has seen in the past decade. The voices we have today are much more lifelike than those most people associate with “text to speech.” When you’re working with TTS, you can produce even better quality files when you follow these few simple steps. 

Work sentence by sentence

Most high-quality TTS editors can generate several sentences at once, but if you’re determined to get the best sound, try creating one sentence at a time. Often you’ll see a huge improvement in both intonation and pausing when you work through each sentence individually. Plus, you can add silences between sentences more easily by working with your clips post-production (more on this below).

Add silences

Silences between words and sentences create rhythmic, natural-sounding speech. As living, breathing beings, voice actors take natural pauses to inhale. In your TTS editor, you can cue the artificial intelligence (AI) to replicate these pauses by adding commas, periods, dashes, and ellipses. Think of these punctuation marks as percussive notes, not as grammatical tools, and you’ll be well on your way to generating natural AI voice recordings.

Let me give a brief example. In this first clip, I entered the following text into WellSaid Studio. I used punctuation in a grammatically-minded way:

Text to speech is a scalable alternative to traditional voice acting. 

Created using WellSaid Studio

Now, listen to the same sentence with percussive punctuation marks added to create an appealing rhythm. Notice how the sentence, while grammatically incorrect, has a natural-sounding cadence to it:

Text to speech, is a scalable alternative, to traditional voice acting. 

Created using WellSaid Studio

Use inventive spelling

Modern TTS services train on neural networks. As a result, they work predictively, and this means they sometimes mispronounce words. Often this happens with words that are spelled the same but are pronounced differently. Think about the homonyms “read” as in, “I can read!” and “read,” as in “I haven’t read this book yet.” Other words that are frequently mispronounced include abbreviations like “CEO” or “USC.” A neural-trained AI voice will read these as funny short words rather than pronouncing the letters. 

To get the right results, spell phonetically. You’ll sometimes need to be explicit with the text to speech editor about how you want a word pronounced, just as you would do with a voice actor. “Read” might need to be entered as “reed,” and “CEO” as “see eeh oh.” 

Play with intonation

Punctuation marks not only add pausing, they also change intonation. If you want a specific word emphasized, try putting it in quotation marks. If you want a different intonation than the one you’re hearing, try seLECTive caps or ALL caps. You can also insert commas and periods before or after the word you want emphasized, as long as the resulting pause is acceptable. 

Using the same example sentence I showed you above, I added some intonation marks to achieve a more lively rendering. “Scalable” is unusual enough that the editor needs a little help, so I entered “scaelable” to prompt the right phonemes.

Here’s the sentence and the audio result:

Text to speech, is a scaelable alternative, to “traditional” VOIce acting. 

Created using WellSaid Studio

Edit post-production

You don’t need to be an expert to get the final polish to your WAV files with a sound editor. Many basic, inexpensive audio editing apps let you add post-production pauses. Add some silence at the start of your clips to mimic a voice actor’s inhale. Add a small amount of silence between your clips as well, and you’ve got quality, human-sounding audio production on your hands.

Credits

Photo by palesa on Unsplash
Music by purple-planet

Tags: 

2 Responses

Leave a Comment

Your email address will not be published. Required fields are marked *

TOPICS