The Kaldi keyword search system is based mostly on a conventional LVCSR pipeline. We have three main sub-systems, which separately decode the data; we then use conventional system combination techniques. The four systems are:
Our keyword search pipeline is based on lattice-indexing as described in [5]; the lattices are generated using the "exact" lattice generation method described in [6]. To handle out of vocabulary (OOV) keywords, we use the method [4] which constructs for an OOV keyword sequence, proxy keyword sequences consisting of word sequences which are phonetically similar. This year we added a "lexicon expansion" method, in which we generate plausible new words using a syllable-level language model and add them to the lexicon and language model when decoding (see Section 4.4). (This even slightly improves the WER). We actually add the original and expanded-lexicon versions of each system to the final system combination, but including non-expanded decodings in the system combination is not really necessary.
The code and scripts used for the main Kaldi system are available as part of Kaldi; see svn://svn.code.sf.net/p/kaldi/code/trunk/. The scripts we used this year are located in the directory egs/babel/s5b.
Something else that is new is the p-norm neural networks [8]. This is a new nonlinearity type that is related to maxout (in that it is a dimension-reducing nonlinearity). This gave us around 1% absolute improvement compared with our old, tanh-based networks. On top of this, for LimitedLP we introduce an ensemble training method . Imagine training four networks from different random seeds. We can average the scores from all of them to get an improvement (around 2% absolute). But we don't like to have to use multiple networks in test time. Our ensemble method introduces a term in the objective function to train the networks' outputs towards each other, to make them more similar, so that in test time we can pick just one of the networks to test with. This gives us three quarters of the improvement from the simple method of averaging the scores, but does not slow us down in test time. We only do this for limitedLP because it slows down training too much to be practical for FullLP.
Our bottleneck feature system is heavily modified since last year, and has improved. Firstly, we implemented it all in Kaldi, as opposed to last year's system which was a hybrid between Kaldi and Theano. This makes the training faster, since Kaldi supports parallelized neural network training, using multiple GPUs. The basic recipe is basically the same as last year-- a DNN with a 42-dimensional bottleneck, appending these features with the baseline fMLLR features, splicing across 3 frames and doing LDA dimension reduction to 60 dimensions, then training an SGMM system on these features. However, results seemed a little better with the Kaldi implementation, perhaps 0.5\% absolute. It's hard to say why, as there are too many differences. The thing that is new is that we implemented semi-supervised training in the LimitedLP case. We use the 1-best output from decoding as supervision for the untranscribed data, but only train on a frame if the state-level posterior is above a threshold (we use a low threshold of 0.35 for this case).
Our point process model system ( Section 4.16), while it get only around half the ATWV of our conventional system by itself, is giving us large improvements in combination with our conventional system, of around 3 to 4% ATWV. This is an unconventional "exemplar-based" approach.
Our expanded lexicon (Section 4.4) also new. This method takes as input the provided lexicon, and uses it to hypothesize likely new words and their pronuciations, along with their probabilities. We generate 2 million extra words, with associated probabilities, and we allocate the "unknown-word" probability mass of our language model to these words. Our method is "backwards", in that we first generate the phonetic sequences, and then work out the spellings. The improvement this gives is extremely variable. For Bengali and Assamese, it makes essentialy no difference. But for Zulu LimitedLP using the development keywords on the development data, it improved the Kaldi-only ATWV from 0.20 to 0.28.
The output of the decoding above is used as the input to the following algorithm. First we map the frames of the decoder best path to one of three classes: speech, noise or silence. The segmentation algorithm is as follows:
k a t _1
would become the phone sequence k_1 a_1 t_1
Formally, each tone version of a phone is a separate phone, but see
our explanation of contenxt dependency below .
We noticed that in some languages, the original lexicon seemed to have been expanded
with some kind of script where some original phone was mapped to two alternative
phones. That was the case for Vietnamese last year and Zulu this year, and it
was helpful to reverse this mapping. Our mapping for Zulu is as follows:
k_> | g_< |
3 | e |
R | l |
o | O |
b_< | b |
t_> | th |
We have four phones in our inventory apart from those that appear in words; they are all modeled in a context independent way and using a different topology (5 states, where the middle 3 states all have transitions to each other). These are for silence, noise, vocalized-noise and unknown-words. The difference between vocalized noise and unknown-words is that vocalized noise models things like coughs and laughs, whereas the unknown-word phone models words whose pronunciation is not known (mainly so we can align them during training).
The lexicon expansion procedure works as follows (but note that lexicon expansion is not the only thing we do to handle OOVs; see also Section 4.15). We first take all the entries in our original lexicon and view them as sentences, where each syllable corresponds to one symbol (we ignore the spelling). We train an ARPA language model on this with SRILM; a 3-gram "modified Kneser-Ney with interpolation" seemed to work the best. We then generate a large number of "sentences" from this language model: 20 million or so, For each unique sentence in the generated sentences, we compute its language model probability; we then exclude the sentences that correspond to words in the original lexicon, take the 2 million best ones, and these will become the pronunciations of our lexicon entries.
A lexicon entry needs a spelling as well as a pronunciation, and to do this we
use the g2p tool from Sequitur in reverse to produce the most likely
spellings for each pronunciation. We reverse it by taking each lexicon entry,
e.g.
hi h iy
and reversing it to produce something like
hiy h i
Actually we don't do it exactly this way because we want iy
to appear as a single
symbol on the left, rather than as a sequence of two symbols. So we map the phones
to ASCII symbols first. When doing so we treat tags (e.g. tones) separately, so each tag
has its own ASCII symbol, and a phone with a tag would be rendered as two ASCII symbols.
We use g2p to generate a list of the top few likely spellings for each of the generated pronunciations. We take the pronunciations we generated and the probabilities of their spellings, and convert them into a list of words with probabilities on the words, and a list of pronunciations for each word with asociated pronunciation probabilities. This is the output of the lexicon expansion and it is used to create the lexicon and language model that we decode with.
We ran two versions of each system, one with and one without the lexicon expansion, because we wanted to see how much effect it was having. Because we had them both available, we decided to combine both versions for the final system combination, but this combination made very little difference to the results and we could equally well have submitted just the expanded-lexicon systems.
The main tunable parameters of the SGMM training are given below:
Num-gauss-UBM | Num-leaves | Num-substates | |
---|---|---|---|
LimitedLP | 750 | 5000 | 18000 |
FullLP | 800 | 10000 | 80000 |
During decoding, we pick just one of the systems arbitrarily. Since it has been trained towards the other networks, it acts a little bit like an ensemble of networks, even though it is just one network. This gives us about 1.5% WER improvement.
For feature extraction we remove the part after the 42-dimensional bottleneck, including the tanh nonlinearity, and append it with the baseline 40-dimensional features, giving an 82-dimensional feature vector. This is appended across ±1 frame and the dimension is reduced with LDA to 60 dimensions. (Note: we don't commence training on these features from scratch but start with alignments from our SAT-trained GMM-based system).
From this point we train an SGMM+BMMI system. Because the feature dimension is higher the number of parameters would increase if we left the rest of the configuration of the system the same, so we use the following reduced configuration values:
Num-gauss-UBM | Num-leaves | Num-substates | |
---|---|---|---|
LimitedLP | 500 | 5000 | 10000 |
FullLP | 5500 | 10000 | 50000 |
Stage | Num-leaves/gauss | Num-leaves/gauss | Feature type |
---|---|---|---|
(LimitedLP) | (FullLP) | ||
mono | n/a | n/a | delta+delta-delta |
tri1 | 1000/10k | 1000/10k | delta+delta-delta |
tri2 | 2500/36k | 1000/20k | delta+delta-delta |
tri3 | 2500/36k | 6000/75k | delta+delta-delta |
tri4 | 2500/36k | 6000/75k | LDA+STC |
tri5 | 2500/36k | 6000/75k | LDA+STC+fMLLR |
The lattices from the speaker-independent decoding are used with the speaker-adapted "tri5" model to compute initial fMLLR transforms, which are used with the speaker-adapted model to rescore the lattices to get better posteriors and estimate the fMLLR transforms a second time. Then another lattice generation pass is done with the speaker-adapted model and adapted features, and the fMLLR transforms are estimated a third time and the lattices rescored with those features.
Note: we don't include silence frames in the fMLLR computation. Since the lattice generates soft counts, this is accomplished via per-frame weights, not a hard cutoff.
The decoding of the later models-- the SGMMs, DNNs and bottleneck feature based SGMMs-- all depend on the "tri5" decoding because they use the fMLLR transforms generated there.
Once we have these transforms, the DNN decoding is single-pass, but for the discriminatively trained DNNs we first decode with the basic DNN and then rescore the lattices with four different versions of the final DNN system, one for each epoch. This is so that we can choose the best epoch to use.
The SGMM decoding naturally has two passes: one using a speaker-independent version of the SGMM system (speaker-independent because it doesn't have speaker vectors, although we do have fMLLR features), and then another pass of decoding after estimating the speaker vectors. However, we only generate the lattice once. In order to ensure an accurate final lattice, we dump the state-level lattice from the first pass of decoding and don't do the final lattice-determinization until after estimating the speaker vectors. See [6] if the term "state-level lattice" is confusing.
We build a separate index for each language model scale in a predetermined range (e.g. 10, 12, 13, 14, 15), so that we can separately run the keyword search pipeline for each scale, and pick the scale with the best ATWV on the dev data. (Note: since there is only one dev set, all our numbers reported on the dev set have these scales optimized on that set, and the same applies for WER numbers).
The most non-obvious aspect of this is the per-keyword normalization of the scores. The Term Weighted Value (TWV) metric, after ignoring constant terms and doing a few manipulations, may be expressed as follows:
TWV = const + sum-over-keywords ( 1/K ( Ntrue-hit / Ntrue - beta/duration NFA ) )
Here, sum-over-keywords is taken over all keywords that were actually seen in the test set being considered. The values in the equation may be defined as follows:
Name | Definition |
---|---|
K | Number of keywords that appear in this test set |
Ntrue-hit | Number of occurrences of this keyword that we correctly spotted. |
Ntrue | Number of times this keyword actually occurred in this test set. |
NFA | Number of incorrect hits of this keyword that we produced. |
beta | A constant equal to exactly 999.9 (don't ask) |
duration | The total number of seconds of audio in the test set: a constant we know exactly. |
TWV-contribution = 1/K ( Ntrue-hit / Ntrue-estimate - beta/duration NFA )
Here, Ntrue-estimate and beta/duration are both known quantities. Consider one putative hit, i.e. one location in time where we have a nonzero posterior and we might want to produce a hit. Let the posterior of the keyword in the lattice be p. Let's assume that p is a reasonable estimate of the probability that the keyword actually exists there, which is reasonable assuming model correctness. As an aside, note that we scale down the acoustics in our lattices while computing the posteriors, so the probabilities are quite well calibrated; also, we have plotted the (posterior in our lattice) versus (probability that the word was actually there) and it's within spitting distance of a straight line. Anyway, back to the task at hand. We can write, for this putative hit,
expected-TWV-contribution = 1/K ( p / Ntrue-estimate - beta/duration (1-p) ) .
Here, all but one of the quantities in the equation are known. K is not known, because we don't know how many keywords were actually seen in the test set, but because we only care about the sign of this quantity we don't actually need to know K. For a putative hit, the equation above gives us all we need to know in order to know whether to say "yes" or "no": if it's positive, "yes", else "no". We want to keep the hit if this is positive, i.e. if.
p / Ntrue-estimate - beta/duration (1-p) > 0
p (1/Ntrue-estimate + beta/duration) - beta/duration > 0
p > (beta/duration) / (1/Ntrue-estimate + beta/duration)
p > Ntrue-estimate / (duration/beta + Ntrue-estimate)
Let's call the value above the "threshold", i.e.
threshold = Ntrue-estimate / (duration/beta + Ntrue-estimate)
(there is a different threshold for each keyword). In order to make it easier to choose
the cutoff point for when to stop producing hits, we would to produce the output
as normalized scores that are all somehow comparable to each other. That way we can tune a global threshold.
We would like to normalize our scores in such a way that they are still all between zero and one.
We do this by converting p to a log-ratio, i.e. q = log(p / (1-p)), computing a similar log-ratio for the
threshold, i.e. t = log(threshold / (1-threshold)), and then subtracting t from q,
i.e. q' = q - t, to produce a normalized log-ratio q' (so if q' > 0, then p > threshold).
Then we convert back from a log-ratio to an actual
probability, call this p'. When we work out the equations for this, it comes out to
p' = (1-threshold) * p / ((1-threshold)*p + (1-p)*threshold)
We have described our basic proxy search procedure in [4] so we will not repeat it at length here. The basic idea is to use a learned phone confusion matrix to find a list of in-vocabulary word sequences that are phonetically close to the sequence we want, with associated penalties for being too distant. As a special case, we don't penalize the proxy sequences for having extra phones at their beginning and end (so, for instance, if the pronunciation of a searched-for word appeared as part of a longer word, we would allow that without penalty).
As background, our index lookup is actually done by FST composition, where one of the things to be composed is the "query FST" (normally with a linear structure) and one is the huge index. In our proxy search method, we represent the set of proxy keywords, and their associated weights, as an FST, and to the keyword search pipeline it looks no different from a linear sequence (since the input is just an FST).
There is something new about our proxy keyword search pipeline this year. After implementing the "expanded lexicon", we noticed that the process of generating proxy keywords was very slow. This procedure involves various operations of composition and determinization, where the inputs are a linear sequence consisting of the OOV keyword (as phones), a phone-edit-distance FST, and a lexicon. When we made the lexicon much bigger, it became slow. In order to make it fast again, we had to rearrange the order of composition and determinization, and implement an "on-demand" FST pruning procedure for OpenFST (as part of the Kaldi extensions to OpenFST).
The point process model (PPM) for keyword search [9] is a whole-word, event-based acoustic modeling and phonetic search technique. It operates on sparse phonetic event streams extracted from the speech signal using a frame-level subword acoustic model. In our Babel system, we use our Kaldi Deep Neural Network acoustic models described above to generate posteriorgrams over context-dependent states. We subsequently sum posterior dimensions sharing the same center phone to produce monophone posteriorgrams for each utterance. After applying the matched filter smoothing of [10], local maxima of each posterior trajectory define phonetic event times. The set of phonetic events for the search collection defines the index for subsequent keyword search; this construction, which is performed entirely independent of the keyword set, is our only use of the test audio.
The next stage is point process model construction. For
in-vocabulary words, we perform MAP estimation of the Poisson rate
parameters for each word in the lexicon [11]. This takes advantage of
any exemplars present in the training data, but falls back on
dictionary-based model priors (the simple variant, see [11] for details)
if no exemplars are available. For OOV keywords, we use Sequitur G2P
pronunciations to construct the dictionary models. Multi-word keyword
models are constructed by concatenating MAP estimated unigram PPMs, with
the overall duration distributions derived using the Monte Carlo
techniques from [12]. Search for each keyword is performed using an
optimized detection function calculation scheme that is 500,000 times
faster than realtime. We consider the PPM system performance both in
isolation and in combination (at the kwslist level) with the Kaldi LVCSR
search engine outputs.
Our systems are not cross-adapted, unless you count the fact that they all use the fMLLR transforms from the shared "tri5" stage. For transcription purposes, the only form of combination we use in the Kaldi sub-system is a combination procedure based on Minimum Bayes Risk decoding, as described in [1]. We view this as a more principled way to do confusion network combination (CNC) [18], without the various heuristics that are used to produce confusion networks. There is one aspect of this that we should explain, which relates to the language-model weight. Normally when decoding, we do a linear sweep over the language model weights over some range (e.g. 10, 11, 12, ... 18), and select the best one. We do the same when combining systems, except that sometimes the different systems will require substantially different language model weights and there is no one weight that is good for all of them; it's not practical to try all possible combinations of weights. When combining systems, we apply a different offset to the language-model weights for each system. This offset is determined by the beginning of the language-model-weight range that we sweep for each system, which in turn was determined by us when setting up the configuration files for our system. So for instance, if we start the regular SGMM system at offset 10, and the bottleneck+SGMM system at 15, then there would be an offset of 5 between the two systems when we do the combination.
We don't bother applying weights to the different systems when combining, but on occasion we do leave out some of the worse systems from the combination. This is decided by a human operator, based on trying different combinations on the dev set. The identities of the systems that were combined will be noted in the individual submissions.
The Kaldi-internal combination for keyword search is based on averaging across systems the unnormalized putative hits (i.e. the lattice posteriors extracted from the index), before normalizing the averaged posteriors using the normalization method described in Section 4.14.. Note that in order to do this averaging, we have to have some notion of when multiple hits are "at the same time". This is pretty obvious (hits are the same if they overlap in time), so we won't refer to it further. If one system did not have a hit at a particular time, it's identical to it having a posterior of zero.
We do not do a conventional average (i.e. a mean).
We wanted to implement something that was in between a mean and a geometric mean. We
used the notion that a geometric mean is a mean of logs, and a log is like a power of
x, (1/p) xp, as p approaches zero. So if we take the mean of xp
for some power p between zero and one, and take the result to the power 1/p,
this is somewhere between a mean and a geometric mean. So this is what we do.
Suppose we have three scores: a, b and c. We choose a power p (say, p=0.5, but it's tuned
per language). Then we let
average = (ap + bp + cp)1/p .
Actually we extend this to a weighted average, i.e.
average = (waap + wbbp + wccp)1/p
where the weights sum to one. The weights are determined manually in small scale
experiments on one of the languages, as the result is not very sensitive to the
weights. We used weights that are fairly close to each other, but with better
systems having larger weights.
We apply the normalization method of Section 4.14. after taking the weighted mean.
DATADEF:==BaseLR{204LimitedLP}:AM{204LimitedLP},LM{204LimitedLP},PRON{204LimitedLP},AR{None}
Ingestion Elapsed Time (hh:mm:ss) - 151:29:03 Ingestion Total CPU Time (hh:mm:ss) - 9546:33:38 Ingestion Total GPU Time (hh:mm:ss) - 92:23:16 Ingestion Maximum CPU Memory (gbytes) - 192 Ingestion Maximum GPU Memory (gbytes) - 16 Search Elapsed Time (hh:mm:ss) - 12:39:08 Search Total CPU Time (hh:mm:ss) - 427:17:22 Search Total GPU Time (hh:mm:ss) - 0:00:00 Search Maximum CPU Memory (gbytes) - 32 Search Maximum GPU Memory (gbytes) - 16