speaker-independent ( SI ) training
</term>
of
<term>
hidden Markov models ( HMM )
</term>
, which uses a large amount of
<term>
#21062First, we present a new paradigm for speaker-independent (SI) training ofhidden Markov models ( HMM ), which uses a large amount of speech from a few speakers instead of the traditional practice of using a little speech from many speakers.
other,6-3-H90-1060,ak
. In addition , combination of the
<term>
training speakers
</term>
is done by averaging the statistics
#21100In addition, combination of thetraining speakers is done by averaging the statistics of independently trained models rather than the usual pooling of all the speech data from many speakers prior to training.
lr,16-6-H90-1060,ak
adaptation ( SA )
</term>
using the new
<term>
SI corpus
</term>
and a small amount of
<term>
speech
#21194Second, we show a significant improvement for speaker adaptation (SA) using the newSI corpus and a small amount of speech from the new (target) speaker.
other,31-2-H90-1060,ak
amount of
<term>
speech
</term>
from a few
<term>
speakers
</term>
instead of the traditional practice
#21079First, we present a new paradigm for speaker-independent (SI) training of hidden Markov models (HMM), which uses a large amount of speech from a fewspeakers instead of the traditional practice of using a little speech from many speakers.
other,22-4-H90-1060,ak
a standard
<term>
grammar
</term>
and
<term>
test set
</term>
from the
<term>
DARPA Resource Management
#21151With only 12 training speakers for SI recognition, we achieved a 7.5% word error rate on a standard grammar andtest set from the DARPA Resource Management corpus.
other,44-2-H90-1060,ak
of using a little speech from many
<term>
speakers
</term>
. In addition , combination of the
#21092First, we present a new paradigm for speaker-independent (SI) training of hidden Markov models (HMM), which uses a large amount of speech from a few speakers instead of the traditional practice of using a little speech from manyspeakers.
other,30-3-H90-1060,ak
the
<term>
speech data
</term>
from many
<term>
speakers
</term>
prior to
<term>
training
</term>
. With
#21124In addition, combination of the training speakers is done by averaging the statistics of independently trained models rather than the usual pooling of all the speech data from manyspeakers prior to training.
other,3-4-H90-1060,ak
<term>
training
</term>
. With only 12
<term>
training speakers
</term>
for
<term>
SI recognition
</term>
, we
#21132With only 12training speakers for SI recognition, we achieved a 7.5% word error rate on a standard grammar and test set from the DARPA Resource Management corpus.
tech,33-3-H90-1060,ak
many
<term>
speakers
</term>
prior to
<term>
training
</term>
. With only 12
<term>
training speakers
#21127In addition, combination of the training speakers is done by averaging the statistics of independently trained models rather than the usual pooling of all the speech data from many speakers prior totraining.
other,3-9-H90-1060,ak
combined by averaging . Using only 40
<term>
utterances
</term>
from the
<term>
target speaker
</term>
#21249Using only 40utterances from the target speaker for adaptation, the error rate dropped to 4.1% --- a 45% reduction in error compared to the SI result.
lr,20-4-H90-1060,ak
word error rate
</term>
on a standard
<term>
grammar
</term>
and
<term>
test set
</term>
from the
<term>
#21149With only 12 training speakers for SI recognition, we achieved a 7.5% word error rate on a standardgrammar and test set from the DARPA Resource Management corpus.
other,10-8-H90-1060,ak
is transformed to the space of the
<term>
target speaker
</term>
and combined by averaging . Using
#21239Each reference model is transformed to the space of thetarget speaker and combined by averaging.
lr,23-6-H90-1060,ak
corpus
</term>
and a small amount of
<term>
speech
</term>
from the
<term>
new ( target ) speaker
#21201Second, we show a significant improvement for speaker adaptation (SA) using the new SI corpus and a small amount ofspeech from the new (target) speaker.
measure(ment),12-9-H90-1060,ak
</term>
for
<term>
adaptation
</term>
, the
<term>
error rate
</term>
dropped to 4.1 % --- a 45 % reduction
#21258Using only 40 utterances from the target speaker for adaptation, theerror rate dropped to 4.1% --- a 45% reduction in error compared to the SI result.
tech,9-9-H90-1060,ak
the
<term>
target speaker
</term>
for
<term>
adaptation
</term>
, the
<term>
error rate
</term>
dropped
#21255Using only 40 utterances from the target speaker foradaptation, the error rate dropped to 4.1% --- a 45% reduction in error compared to the SI result.
other,6-9-H90-1060,ak
40
<term>
utterances
</term>
from the
<term>
target speaker
</term>
for
<term>
adaptation
</term>
, the
<term>
#21252Using only 40 utterances from thetarget speaker for adaptation, the error rate dropped to 4.1% --- a 45% reduction in error compared to the SI result.
other,10-5-H90-1060,ak
comparable to our best condition for this
<term>
test suite
</term>
, using 109
<term>
training speakers
#21170This performance is comparable to our best condition for thistest suite, using 109 training speakers.
lr,27-2-H90-1060,ak
</term>
, which uses a large amount of
<term>
speech
</term>
from a few
<term>
speakers
</term>
instead
#21075First, we present a new paradigm for speaker-independent (SI) training of hidden Markov models (HMM), which uses a large amount ofspeech from a few speakers instead of the traditional practice of using a little speech from many speakers.
measure(ment),14-4-H90-1060,ak
recognition
</term>
, we achieved a 7.5 %
<term>
word error rate
</term>
on a standard
<term>
grammar
</term>
#21143With only 12 training speakers for SI recognition, we achieved a 7.5%word error rate on a standard grammar and test set from the DARPA Resource Management corpus.
tech,22-3-H90-1060,ak
models
</term>
rather than the usual
<term>
pooling
</term>
of all the
<term>
speech data
</term>
#21116In addition, combination of the training speakers is done by averaging the statistics of independently trained models rather than the usualpooling of all the speech data from many speakers prior to training.