Wednesday, July 4, 2018

Word position context dependency of Sphinxtrain and WFST

Word position context dependency of Sphinxtrain and WFST


Interesting thing about Sphinxtrain models is that it uses word position as a context when looking for a senone for a particular word sequence. That means that in theory a senone for the start word phones could be different from senones for the middle-word phones and senones for the end-word phones. Its actually sometimes the case:

ZH UW ER b n/a 48 4141 4143 4146 N
ZH UW ER e n/a 48 4141 4143 4146 N
ZH UW ER i n/a 48 4141 4143 4146 N

but

AA AE F b n/a 9 156 184 221 N
AA AE F s n/a 9 149 184 221 N

Here in the WSJ model definition from sphinx4 a symbol in a fourth column means "beginning", "end", "internal" or "single" and the other characters are transition matrix ids and senone ids.

However, if you want to build WFST cascade from the model, its kind of an issue how to embed the word position into context-dependent part of the cascade. My solution was to ignore position. You can ignore position in already prebuilt model since differences caused by word position are small, but to do it consistently its better to retrain word-position-independent model.

Since of today you can do this easily, mk_mdef_gen tool supports -ignorewpos option which you can set in scripts. Basically everything is counted as an internal triphone. My tests show that this model is not worse than the original one. At least for conversational speech. Enjoy.

P.S. Want to learn more about WFST - read Paul Dixons blog http://edobashira.com and Josef Novaks blog http://probablekettle.wordpress.com

visit link download