Unit selection for Malay text-to-speech system using segmental context and simulated annealing

1 Center for Biomedical Engineering, Transport Research Alliance, Universiti Teknologi Malaysia, 81310 UTM Skudai, Johor DT, Malaysia. 2 Center for Biomedical Engineering, Faculty of Biomedical Engineering and Health Science, Universiti Teknologi Malaysia, 81310 UTM Skudai, Johor DT, Malaysia. 3 Mathematics Department, Faculty of Science, Universiti Teknologi Malaysia, 81310 UTM Skudai, Johor DT, Malaysia.


INTRODUCTION
Since the launching of multimedia super corridor (MSC) project in Malaysia, the information and communication technology (ICT) has been growing rapidly.As a result, computer system as a tool for information and communication medium is becoming more important since then.In addition, the human computer interaction system which involved speech recognition, synthesis etc. also experiences tremendous growth resulting in many applications being developed and commercialized.For instance, Microsoft recently launched the Office XP that has the capability to pronounce (or read aloud) the text input using the speech synthesis engine.Indeed, speech synthesis has been very useful in helping human in various areas such as telephone speech, application in cars, public information systems, education assistance tools, robotic, email reading etc (Mangold, 2001;Salleh et al., 2002).The text to speech (TTS) system is also useful for the people that are physically handicap.For example, speech synthesis has been used as reading and communication tools for visually impaired patients.The first commercial TTS system is Kurzweil Reading Machine for the blind introduced by Raymond Kurzweil in the late 1970's (Klatt, 1987).For the hearing impaired and vocally handicapped, the TTS system has been used as a communication tool with people who are sign language illiterate (Tan et al., 2007a;Tan et al., 2007b;Gold and Morgan, 2000).Another application of the TTS system is helpful automatic machine for language and emotional talk (HAMLET) which is developed to help users to express their feelings (Tan et al., 2008d;Lemmetty, 2001).Speech synthesis or text to speech is Sci.Res.Essays the process of transforming plain text into computer generated synthetic speech (Hasim et al., 2006).There are two types of speech synthesis methods which are parameter synthesis and concatenative synthesis (Hirai and Tenpaku, 2004).The concatenative approach is based on the idea of re-combining natural prosodic contours and phoneme sequences using a superpositional framework (Jan et al., 2005).
Corpus-based concatenative synthesis has become the major trend recently because of its highly natural speech quality (Tan andSheikh, 2008a, 2008b;Sakai et al., 2008).Unit selection is the main component in text to speech synthesis system.It produces highly intelligible, near natural synthetic speech (Tan and Sheikh, 2008c;Tsiakoulis et al., 2008).This method creates speech by re-sequencing pre-recorded speech units selected from a very large speech database (Cepko et al., 2008).Synthetic speech through unit selection is produced by searching through large speech database (corpus) and concatenating selected units, thus forming the output signal.The selection of speech units from the database is based on minimization of target cost and concatenation cost (Clark et al., 2007;Díaz and Banga, 2006;Hunt and Black, 1996).This approach shows its superiority over formant and articulatory synthesis, because it tends to concatenate natural acoustic units with no modification.Thus, offering better speech quality (Janicki et al., 2008).However, large database can also mean costly in terms of database collection, search requirements, and segment memory storage and organization (Chappell and Hansen, 2002).Thus, a robust unit selection is needed to handle the huge volume of data in the database (Blouin et al., 2002).Viterbi algorithms (Blouin et al., 2002;Clark et al., 2007;Cepko et al., 2008;Sakai et al., 2008) is commonly used to solve unit selection problem.However, viterbi search required high computational time (Sakai et al., 2008).The heavy computational time will cause the slow generation of synthetic speech.Simulated annealing (SA) algorithm is able to solve the issue of high computational time by adjusting its control parameters.
The limitation of SA compare to commonly used viterbi search is degraded in speech quality when SA consumes less searching time.The main objective in this research is to investigate the quality of synthetic speech using SA compare to previous synthesis system proposed by Tan and Sheikh (2008a).

MATERIALS AND METHODS
The simulated annealing (SA) algorithm is based on Monte-Carlo methods and it may be considered as a special form of iterative improvement (Manuel, 1997).It was Kirkpatrick et al. (1983) who first proposed SA as a method for solving combinatorial optimization problems.In general, SA algorithm applicable for solving combinatorial optimization problems by generating a sequence of moves at descending values of a control parameter (Jeong and Kim, 1990).The aim of SA is to choose a good solution to an optimization problem according to some cost function on the state space of possible solutions (Rose et al., 1990).SA is a generalization of the local search algorithm.In the iterative process for SA algorithm, its algorithm allowed accepting non-improving neighboring solutions (Turgut et al., 2003) to avoid being trapped at a poor local optimum with a certain probability, whereas other iterative improvement algorithm would allow only cost-decreasing ones to be accepted.

Procedure of unit selection
Unit selection starts with segmental context matching process.The selection of appropriate speech phoneme unit takes into consideration of match of left and right segmental context or textual content for each of the required phoneme.For example, the synthesis word "nasi" required phonemes /n/, /a/, /s/ and /i/.The required left and right segmental contexts for phoneme /s/ are /a/ and /i/ respectively.When unit selection is searching for candidate units for phoneme /s/, it will retain the unit with the desired surrounding segmental context only.This is called segmental context target cost minimization and is the first step of unit selection.Next, the retain candidate units will go through join cost minimization.First, an initial solution needs to be generated.In the beginning, the initial configurations including initial temperature and annealing schedule need to be determined.After the initial temperature is chosen, generate an initial solution and its cost function value.The initial temperature is set at a high level so that almost all moves will be accepted initially.This initial solution is defined as current solution.Next, obtain a neighboring solution of the current solution using local search technique.The temperature is lowered according to annealing schedule along the algorithm until almost no moves will be accepted.Obtain the cost of the neighboring solution and compare it to the cost of the current solution.If the cost is better, it will be accepted as current solution.
Else, the new solution may be accepted as current solution only when the Metropolis's criterion is met which is based on Boltzmann's probability.This process then continues from the new current solution and SA algorithm stops when stopping criteria is met. Figure 1 shows SA flow diagram to find best speech unit sequence.According to Metropolis's criterion (Metropolis et al., 1953) as shown in Figure 2, if the new cost is lower than the current solution in the case of minimization, it is updated as current solution.Else, it is accepted with probability P where P is Boltzmann's equation.
Probability of accepting a non-improving solution as the current solution is based on Boltzmann's equation represented by Equation 1, where T is the temperature; ∆E is different between new cost and current solution.Next, a random number λ in (0, 1) is generated from a uniform distribution and compare it with value P. If P λ > , then the new solution is accepted as current solution.Else, it is rejected. (1)

Computational of concatenation cost
To measure the spectral distance between two join segments, the final frame's mel frequency cepstral coefficients (12 coefficients) of the current unit will pair with initial frame's mel frequency cepstral coefficients (12 coefficients) of the next unit.Then the 12 coefficients of initial and final frame of speech unit were used for distance measure.The Euclidean distance will then take the 12 coefficients of initial and final frame to calculate spectral distance between two joined segments.We called this as local concatenation cost or join cost.
Where, 1 2 , ,.... N u u u u = denote the units in the inventory U which minimize the concatenation cost in Equation 2. Equation 2 is used as spectral distance measurements and also as a quantitative measurement of quality of synthetic speech.To get the local concatenation cost value, it requires the parameterization of units and a distance measure.Concatenation cost between units i u and 1 i u − can be written as:

∑
The choice of a cooling schedule has an important effect on performance of SA algorithm (Ali et al., 2002).For this reason, modifications and improvements have been tried by tuning the parameters (cooling rate) for better quality or time tradeoff.These temperature values are controlled by a cooling schedule that specifies the initial and decreasing temperature values at each stage of the algorithm.The following geometric function has been Is prob > λ ?prob = ( ) ( λ is a random number from 0 to 1)  taken as the temperature reduction function: Where k T is the temperature at stage k, α is the temperature reduction rate.In this research, the various temperature reduction rates was tested which are 0.80, 0.85, 0.90 and 0.95. Figure 3 shows the temperature reduction pattern for these temperature reduction rates for length of Markov chain equal to one.The initial temperature 0 T is set relatively high so that most of the moves are accepted in the early stages and there is little chance of the algorithm intensifies into the region of local minimum.The initial temperature and final temperature (Chen and Su, 2002)  .The length of the Markov chain is required to decide how many trials are to be used at each value of T. Markov chain used here was reduced by the temperature according to annealing schedule after two successive iterations.Markov chain length is simply fixed to 2 since this is not main focus of this research.

Neighborhood generation mechanism
The neighborhood generation mechanism apply here is randomly swap a phoneme per iteration.Example:

Initial solution
The number in bracket, "[ ]" represents the units candidate number after matching process (Figure 4).The join cost is computed as:

∑ ∑
Where n depicts the total number of local cost.Local cost here refers to only concatenation cost.

Iteration 1
Apply move 1 to obtain neighborhood solution.

Neighborhood solution
The phoneme 3 is chosen randomly to swap.The candidate's number is changed randomly from 1 to 7 for example.When phoneme 3 is changed, the local cost (i + 1) and local cost (i + 2) will be changed while local cost (i) and local cost (i + 3) will remain unchanged (Figure 5).If this neighborhood solution is accepted as current solution, then it generates another neighborhood solution based on this newly accepted solution.

Neighborhood solution
The phoneme 1 is chosen randomly to swap.The candidate's number is changed randomly from 1 to 9 for example.When phoneme 1 is changed, only the local cost (i) will be changed while other local cost will remain unchanged (Figure 6).If this neighborhood solution is accepted as current solution, then generate another neighborhood solution based on this newly accepted current solution.

Iteration 3 solution
The phoneme 4 is chosen randomly to swap.The candidate's number is changed randomly from 1 to 8. When phoneme 4 is the local cost (i + 2) and local cost (i + 3) will be changed while local cost (i) and local cost (i + 1) remain unchanged (Figure 7).If this neighborhood solution is rejected as current solution, then go back to the previous iterations and generate another neighborhood solution based on the solution of previous iterations (current solution).

RESULTS AND DISCUSSION
To evaluate the performance of the proposed system and previous version of corpus-based Malay text-to-speech system, comparison of join cost in unit selection which correspondence to perceptual scores is conducted.The proposed system in this research is actually exactly the same as previous version except the unit selection, computational of join cost minimization.Therefore, join cost comparison is able to distinguish the performance between two systems since everything is the same except the join cost minimization.Table 1 shows the 40 Malay words selected for join cost comparison that covers almost all Malay phoneme set using Equation 2. Join cost I represents join cost obtained by the previous version of synthesis system while Join cost II represents mean join cost obtained by the proposed system using SA algorithm.Smaller value of join cost indicates the smaller spectral discontinuity at the join segments and better quality of synthetic speech.The join cost obtained for 38 words produced by the proposed system is better than the previous version with an average improvement of 15.48% for 40 words.Thus, this indicates that the synthesis words produced by the proposed system have better smoothness of join boundary.A formal listening test involved 40 Universiti Teknologi Malaysia students with no hearing loss was conducted to evaluate the output sound.55% of the listeners are female and the rest are male listeners.The ages of listeners were range between 21 and 27, with a mean age of 24 years old.35% of the listeners were native speakers of Malay language while the rest were not.The listening test was conducted individually using headphone and a set of computers PC Pentium IV 3 GHz in a quiet room.The testing methods that can be used in corpus-based Malay text-to-speech system were modified rhythm test (MRT), mean opinion score (MOS) and perceptual test which is also known as intelligibility test (Rilliard and Aubergé, 2001) 3.

Conclusion
This paper is a first version of unit selection using 'simulated annealing' for corpus-based Malay text-to-  speech system.This system has achieved its aim of improving the speech quality compare to previous version of corpus-based Malay text-to-speech system.The unit selection is based on two cost functions which are target cost and concatenation cost.In the proposed method, the segmental context is used as a target cost for matching process.The retain candidate units are used as an input for SA in concatenation cost minimization.The listening test and values of join cost obtained have justified the improvement of speech quality of the proposed system.Therefore, SA is a suitable method for unit selection since it has made a contribution in improving the speech quality by selecting the best speech unit sequence within reasonable computational time.For future research, HMM-based speech synthesis (Zen et al., 2009;Pucher et al., 2010;Tokuda et al., 1995, Chomphan andKobayashi, 2008) can also be developed for Malay language since it has gained attention of many researchers recently due to its flexibility in generating speech from parameter generation algorithm.Since the performance of 'simulated annealing' depends highly on parameters setting, therefore, parameters tuning is one possible approach in future research to improve the performance of 'simulated annealing'.Other heuristic method such as 'genetic algorithm' and Tabu search can also be conducted in unit selection.

P
in unit selection is set according to the Equations 3 and 4 respectively, is the desired initial probability and f P is the desired final probability.The parameters values in unit selection are given as follows:

Figure 3 .
Figure 3. Temperature reduction pattern for various reduction rates with Markov chain length 1.

Table 1 .
. The 'modified rhyhm test' and 'mean opinion score' can be group in 'auditory test'.Modify rhythm test, word spotting test and mean opinion score listening test were conducted.The aims of 'modify rhythm test' are to evaluate the accuracy and verify the intelligibility of the synthetic speech sound especially in pronunciation.A set of 50 questions with Words selected for join cost comparison.

Table 2 .
Distribution of 40 words in terms of different magnitude of improvement in join cost.

Table 3
Result of the listening test.