Synthesis of Emotional Speech using RP-PSOLA

Authors: Vine, D.S.G. and Sahandi, R.

Start date: 13 April 2000

Publisher: IEE

Whilst TD-PSOLA remains an adequate solution for neutral speech synthesis, it is less suitable for emotional speech styles, which require more extreme pitch manipulation. By reducing the extent of the necessary pitch manipulation, distortions and artefacts introduced by TD-PSOLA could potentially be lessened. To accomplish this, a method for recording concatenative units with f0 values similar to the target intonation has been devised. This technique, termed reference pitch prompted recording, involves a speaker recording concatenative units at a set pitch. The speaker is guided by a `reference pitch prompt' (RPP), which is a monotonic, hummed note. In RP-PSOLA (reference pitch-PSOLA) synthesis, RPP-recorded units such as syllables are concatenated and an intonation contour applied using TD-PSOLA. RP-PSOLA can be extended so that several versions of each syllable are recorded, each at a different pitch. In this synthesis technique, termed multiple pitch RP-PSOLA, syllables are selected from an inventory to approximate to the target f0 contour and concatenated. This paper compares the RP-PSOLA and multiple pitch RP-PSOLA synthesis methods in terms of the perceived distortion in emotional synthetic sentences, via a listening experiment. The results showed that multiple pitch RP-PSOLA is perceived to produce marginally less distorted synthetic speech than RP-PSOLA overall

