1 Introduction
Keyword spotting (KWS), also known as spoken term detection (STD), is the task of detecting some predefined keywords from a stream of utterances. It is usually used as an intelligent agent in mobile phones or smart devices. Recently, deep neural network (DNN) based KWS has led to significant performance improvement over conventional methods. Deep KWS [1]
first considers keyword spotting as an audio classification problem. It trains a DNN model to predict the posteriors of predefined keywords, in which each neuron in the softmax output layer of the DNN model corresponds to a keyword, with an additional “filler” neuron representing all other nonkeyword segments. This classificationbased method achieves significant improvement over the keyword/filter hidden Markov models. Later on, a number of classificationbased methods
[2, 3, 4, 5, 6, 7, 8] were explored to miniaturize the memory footprint.However, because the softmax cross entropy loss focuses on maximizing the classification accuracy of the training data, the aforementioned models require a large number of training samples to achieve robust performance against various nonkeyword segments in the test stage [1, 2, 4]. Because collecting as many types of nonkeyword segments as possible for the model training is expensive and sometimes unavailable, the classificationbased models [3, 5, 6, 7, 8] perform particularly poor in practice. Moreover, using a single “filler” neuron to represent all nonkeyword segments does not reflect the diversity between these sounds, which will further degrade the performance.
Recently, several works [9, 10, 11, 12, 13]
introduced metric learning into KWS. Metric learning adopts a ranking loss to learn the relative distance between samples. It aims to enlarge the interclass variance and reduce the intraclass variance in an embedded space of data. However, it will result in a significant performance drop if we directly apply metric learning to KWS without taking the prior knowledge that the target keywords are predefined and fixed into consideration. To address the problem, Huh
et al. [12]proposed an angular prototypical network with fixed target classes (APFC) to enhance the robustness against nonkeyword segments. However, they have to use an additional support vector machine (SVM) to make the final decision. In
[13], Vygon et al. combined a triplet lossbased embedding extractor with a KNearest Neighbor (kNN) classifier, which gets higher accuracy than the cross entropy loss based methods. Their method exceedingly increases the number of parameters and computational complexity of the KWS model.
Motivated by some works on the openset recognition problem [14, 15], in this paper, we propose a new loss function, named the maximization of the area under the receiveroperatingcharacteristic curve (AUC), and a confidence based decision method, which leads to a robust, smallfootprint, and high accuracy KWS model. Specifically, the proposed multiclass AUC loss maximizes the classification accuracy of predefined keywords, and the detection AUC of nonkeyword segments simultaneously. We compared the proposed multiclass AUC loss with softmax cross entropy loss [3], prototypical loss [12], APFC loss [12], and triplet loss [13] on the Google Speech Commands dataset v1 [16] and v2 [17]. Experimental results demonstrate that our methods outperform the comparison methods in most evaluation metrics. The main contributions of this paper are summarized as follows:

To our knowledge, we reformulate the low resource keyword spotting task as an openset recognition problem for the first time.

We propose a novel multiclass AUC loss. It outperforms the four representative referenced methods in most evaluation metrics.

We propose a new confidencebased decision method. It helps the proposed method achieve the stateoftheart performance without using a complex backend classifier.
2 Background
The original AUC optimization is designed for binaryclass classification only. Therefore, before describing the proposed multiclass AUC loss function, we first take a look at the existing binary AUC optimization.
Given a binaryclass dataset where , and a binaryclass neural network with being the parameter of the network, we define two new subsets: which is a set of neural network scores for the samples with , and which represents a set of neural network scores for the samples with . Cardinalities of these two subsets are and respectively. As described in [18], for the finite set of samples
, the approximate estimate of the AUC metric is:
(1) 
where is an indicator function that returns 1 if the statement is true, and 0 otherwise, and and are the elements of and respectively. As [19] did, we relax (1) by replacing the indicator function by a modified hinge loss function:
(2) 
where , and
is a tunable hyperparameter controlling the distance margin between
and . Substituting (2) into (1) transforms the maximization problem of (1) into the following minimization problem:(3) 
which can be easily backpropagated throughout the network in a standard procedure.
3 Algorithm description
3.1 Problem formulation
In this paper, we decompose the KWS task into a nonkeyword segments detection subtask and a closedset classification subtask. Specifically, for a given input sample, we first determine whether it belongs to a predefined keyword set. If so, then we decide which keyword it is. Note that the two subtasks are performed simultaneously in our proposed method.
To formalize the task, suppose there is a dataset where is a highdimensional acoustic feature of the th sample, and is the groundtruth label of . Note that, without loss of generality, we always assume that there are categories with class representing nonkeyword segments, and the other classes representing keywords respectively.
We aim to train a neural network where is the parameter of the network. It maps the dimensional input acoustic feature to a
dimensional vector. Each dimension of the vector represents the confidence score of its corresponding keyword. In the test stage, we use
to conduct KWS by the following criterion:(4) 
where is the output scores of the neural network , and is a decision threshold. For simplicity, we denote in the remaining of the paper.
3.2 The proposed multiclass AUC optimization
Several studies have extended the binary AUC optimization to multiclass problems e.g. [19, 20]. In this work, we propose a new extension suitable for most multiclass classification tasks and computationally straightforward. The key idea of this extension is to modify the two subsets and in the binary AUC optimization to new forms that satisfy the multiclass AUC optimization problem.
Specifically, for the general KWS problem with more than one keyword, we define the subset of positive examples as
and the subset of negative samples with
where is the score at the th position of the vector , is the maximum value of after removing the score at the th position of , and
represents the set of the output scores of the neural network for the nonkeyword segments in .
Algorithm 1 presents the proposed multiclass AUC loss in detail.
3.3 Confidence based decision for the multiclass AUC loss
In the test stage, the decision threshold is calculated on a validation set by:
(5) 
where is the size of the validation set.
3.4 Connection to other loss functions
This subsection presents the connection of the proposed multiclass AUC to other loss functions.
3.4.1 Connection to multiclass hinge loss
Under the same supposition in Section 3.1, the multiclass classification hinge loss is presented as:
(6) 
The connection between the proposed multiclass AUC loss and the multiclass hinge loss is as follows. The multiclass AUC loss calculates the loss on the whole training set. It essentially learns a rank of the training samples without resorting to a classificationbased loss explicitly. In contrast, the multiclass hinge loss calculates the optimization objective on each sample respectively and then averages them on the entire dataset. It needs to assign all nonkeyword segments to a single class.
3.4.2 Connection to APFC loss
The APFC loss first arranges the keywords in a predefined order. Then, for each minibatch, it selects one sample from each keyword, followed by nonkeywords. Note that the first samples should be arranged in the predefined order of the keywords.
According to [12], we rewrite the APFC loss as:
(7)  
with
(8) 
where is the extracted feature of the th sample by the neural network, is the learnable class center of the th keyword, and are learnable parameters with .
The proposed AUC loss and AFFC loss are similar in that they do not assign widely distributed nonkeyword segments to a single “filler” class. However, the implementation of the APFC loss has a strict constraint on the samples in each minibatch. Moreover, the APFC lossbased model still needs an SVM backend to make the final decision.
3.4.3 Connection to other multiclass AUC loss
The multiclass AUC optimization in [20] is a natural extension of the binary AUC optimization. Gimeno et al. extended the binary AUC optimization to the multiclass problem by the oneversusone and oneversusrest frameworks. The oneversusone multiclass AUC loss is obtained by averaging the pairwise binary AUC losses. The oneversusrest multiclass AUC loss decomposes the multiclass classification task to binary tasks. For the th task, the th class is viewed as a positive class, and all other classes are merged into a negative class. However, the above two methods cannot be directly used for our openset optimization problem, since that they need to assign nonkeyword segments to a “filler” class. In addition, it is obvious that our proposed AUC loss is more computationally efficient than the above two methods.
4 Experimental setup
4.1 Data preparation
In our experiments, two popular keyword spotting datasets, Google Speech Commands version 1 (GSC v1) [16] and version 2 (GSC v2) [17] are used for evaluation. The dataset GSC v1 consists of 65K onesecondlong recordings of 30 words from thousands of different speakers. GSC V2 is an augmented version of GSC v1, which contains 105K utterances of 35 words. In addition, both datasets contain several minutelong background noise files. The sampling rates of all signals are 16 kHz in the two datasets.
Both GSC v1 and GSC v2 include a “validation_list” file and a “testing_list” file. We use audio files in the “validation_list” and “testing_list” as validation and testing data, and the other audio files as training data. Following previous works, we apply random timeshift and noise injection to training data. Specifically, we first perform a random timeshift of milliseconds to each sample, where
. We then add background noise to each sample with a probability of 0.8, where the noise is chosen randomly from the background noises. Note that the random timeshift and noise injection are performed on the fly at each training step. Finally, 40dimensional Melfrequency Cepstrum Coefficient (MFCC) features are extracted and stacked over the timeaxis with a window length of 25ms and a stride of 10ms.
4.2 Backbone network
We use res15 [3] as the backbone network. As shown in Figure 1, it starts with a biasfree convolution layer (Conv) with weight , where and are the height and width of the convolution kernel respectively, and is the number of the output channels. Then, it takes the output of the first convolution layer as the input of a chain of residual blocks (Res), followed by a separate nonresidual convolution layer. Finally, the output of the network is obtained by an averagepooling layer (Avgpool). Additionally, a
convolution dilation is used to increase the receptive field of the network, and a batch normalization layer (BatchNorm) is added after each convolution layer to help train the deep network. The details of the backbone network are listed in Table
1.#Par.  #Mult.  
Conv  3  3  45  1  1  405  1.52M 
Res6  3  3  45  219K  824M  
Conv  3  3  45  16  16  18.2K  68.6M 
BatchNorm      45        169K 
AvgPool      45        45 
Total            238K  894M 
Loss  Backend  GSC v1  GSC v2  

Total acc  Closed acc  F1 score  Total acc  Closed acc  F1 score  
Cross entropy [3]    89.96%  97.14%  0.8805  92.74%  97.46%  0.9068 
Prototypical [12]    87.89%  95.88%  0.8654  93.32%  96.55%  0.9149 
APFC [12]  SVM  91.59%  96.72%  0.8962  93.77%  97.11%  0.9188 
Triplet [13]  kNN  92.09%  97.28%  0.9019  94.01%  97.78%  0.9251 
Multiclass AUCR    92.16%  97.01%  0.9031  94.87%  97.39%  0.9315 
Multiclass AUCF    92.97%  97.22%  0.9115  94.71%  97.50%  0.9312 
4.3 Tasks and evaluation metrics
The tasks in previous works [3, 5, 7, 8] focus on discriminating the 11 keywords (“yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”, “silence”) and a nonkeyword “unknown”, where “silence” denotes silence segments and “unknown” represents all other words. In their settings, all unknown words used in the test set have been seen by the model in the training stage, which is not consistent with realworld KWS applications.
To meet the realworld KWS applications, in our experiments, we consider the task in [12], where ten unknown words (“zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”) are used for testing only. We evaluate our model by comparing the following metrics with other related works.

Total acc is the classification accuracy on the test set that contains unseen unknown words, which is more likely to reflect the performance of the KWS model in the real world. Note that unseen unknown words represent the above ten unknown words that are used for testing only.

Closed acc is the classification accuracy on the test set that does not contain unseen unknown words.

We also report the F1 score that is extended to multiclass one by “macro” average on the test set that contains unseen unknown words.
In addition, we plot the detection error tradeoff (DET) curve of the nonkeyword segments detection subtask to evaluate the KWS models.
4.4 Data sampler
Usually, the training samples in each minibatch is randomly sampled from the whole training set, which results in the proportion of the keywords over nonkeywords in each minibatch vary greatly. We denote this sampling method as random sampler. However, the variable proportion will hinder the convergence of the model training of the proposed method. To overcome this problem, we use a fixed proportion sampler, which keeps the proportion of keywords and nonkeywords consistent in each minibatch.
4.5 Training details
Each model in our experiments is trained for 60 epochs, using the Adam optimizer
[21]. The initial learning rate is set to 0.001 and reduced to 0.0001 after 30 epochs. For the cross entropy loss, we use a minibatch size of 128 and weight decay of . We use the same hyperparameters in [12] and [13] for the prototypical loss, APFC loss and triplet loss. We use the validation set to select the best model among different epochs and evaluate the effect of the hyperparameter .We evaluate the proposed multiclass AUC loss with the fixed proportion sampler and the random sampler. For the fixed proportion sampler, the number of keywords and nonkeywords in each minibatch is set to 32 and 64, respectively; for the random sampler, the minibatch size is set to 128, which is the same as the other comparison methods. The hyperparameter is set to 0.3. Following the same training procedure, we evaluate all comparison methods for five independent times, and report the average performance.
5 Results
5.1 Evaluation of the proposed methods
Table 2 lists the comparison result between the proposed methods and the four baselines. From the table, we see that both the two variants of the proposed multiclass AUC loss achieve significant improvement in terms of the Total acc and F1 score, and achieve a competitive result with the best referenced method in terms of the Closed acc. We take the result on GSC v1 as an example. Comparing to the cross entropy loss, the multiclass AUC loss with the fixed proportion sampler achieves 30.0% and 25.9% relative improvement in Total acc and F1 score respectively. It also achieves a slightly higher Closed acc than the cross entropy loss. Even when compared with the triplet loss with a complex kNN backend, the proposed method still achieves a relative improvement of 11.1% in Total acc and 9.8% in F1 score while maintaining a similar Closed acc.
AUC  Cross Entropy  

sampler  0.1  0.2  0.25  0.3  0.35  0.4  0.5  
Closed acc  R  94.52%  96.29%  96.53%  96.85%  96.72%  96.49%  95.54%  96.20% 
F  94.04%  96.54%  96.71%  96.81%  96.44%  96.53%  96.20%  
F1 score  R  0.9426  0.9578  0.9581  0.9615  0.9577  0.9535  0.9429  0.9513 
F  0.9321  0.9599  0.9613  0.9613  0.9553  0.9546  0.9508 
To further investigate the effectiveness of the proposed method, we conduct a comparison on GSC v2 using the same settings as that on GSC v1. The experimental results again demonstrate the superiority of our method. In addition, the result on GSC v2 indicates that the training data of GSC v2 is responsible for the substantial improvement in all evaluation metrics, which is consistent with the experimental phenomenon in [17]. However, although both of the two variants of the proposed multiclass AUC loss achieve better results on GSC v2 than that on GSC v1, the improvement with random sampler is more evident than that with the fixed proportion sampler. This may be caused by that the training data of GSC v2 contains more nonkeywords than the training data of GSC v1.
From Table 2 we also see that the APFC loss with a SVM backend and the triplet loss with a kNN backend outperform the prototypical loss and the cross entropy loss. It demonstrates that the metric learningbased methods still require a decision backend to achieve satisfactory performance. In addition, we plot the DET curves of the nonkeyword segments detection subtask in Figure 2. From the figure, we see that these curves are consistent with the results presented in Table 2, and we see that the two variants of the proposed multiclass AUC loss outperform the referenced methods.
5.2 Effect of the hyperparameter on performance
This subsection investigates the effect of the hyperparameter on performance. Becaue there are no unseen unknown words in the validation set, here we only use the Closed acc and F1 score as the evaluation metrics. For simplicity, we show the experimental results on GSC v1 only. Note that the experimental phenomenon on the other evaluation dataset are consistent with that on GSC v1. Table 3 lists the result on GSC v1. From the table, one can see that the parameter , which controls the margin of the AUC loss, plays an important role on the performance. Both of the two variants of the multiclass AUC loss outperform the cross entropy baseline in the two evaluation metrics when . It is also observed that the results in both the two evaluation metrics first increase and then decrease along with the increase of , where the best performance is achieved at .
6 Conclusions
In this study, we have proposed a robust and highly accurate KWS method based on a novel multiclass AUC loss function and a confidence based decision method. Our KWS method not only significantly improves the robustness of the model against unseen sounds by optimizing the proposed multiclass AUC loss, but also eliminates the complex backend processing module by using the simple confidence based decision method. To our knowledge, it is the first time that the low resource keyword spotting task is formulated as an openset recognition problem. We compared the proposed method with four representative methods on the two public available datasets GSC v1 and GSC v2. Experimental results show that the proposed method significantly outperforms the four representative methods in most evaluations with smaller model sizes and less computational complexity than the latter.
References
 [1] Guoguo Chen, Carolina Parada, and Georg Heigold, “Smallfootprint keyword spotting using deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 4087–4091.

[2]
Sercan Ö Arık, Markus Kliegl, Rewon Child, Joel Hestness, Andrew
Gibiansky, Chris Fougner, Ryan Prenger, and Adam Coates,
“Convolutional recurrent neural networks for smallfootprint keyword spotting,”
Proc. Interspeech 2017, pp. 1606–1610, 2017.  [3] Raphael Tang and Jimmy Lin, “Deep residual learning for smallfootprint keyword spotting,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5484–5488.
 [4] Changhao Shan, Junbo Zhang, Yujun Wang, and Lei Xie, “Attentionbased endtoend models for smallfootprint keyword spotting,” Proc. Interspeech 2018, pp. 2037–2041, 2018.
 [5] Seungwoo Choi, Seokjun Seo, Beomjun Shin, Hyeongmin Byun, Martin Kersner, Beomsu Kim, Dongyoung Kim, and Sungjoo Ha, “Temporal convolution for realtime keyword spotting on mobile devices,” Proc. Interspeech 2019, pp. 3372–3376, 2019.
 [6] Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Zhengkun Tian, Chenghao Zhao, and Cunhang Fan, “A time delay neural network with shared weight selfattention for smallfootprint keyword spotting.,” in INTERSPEECH, 2019, pp. 2190–2194.
 [7] Menglong Xu and XiaoLei Zhang, “Depthwise separable convolutional resnet with squeezeandexcitation blocks for smallfootprint keyword spotting,” Proc. Interspeech 2020, pp. 2547–2551, 2020.
 [8] Chen Yang, Xue Wen, and Liming Song, “Multiscale convolution for robust keyword spotting,” Proc. Interspeech 2020, pp. 2577–2581, 2020.
 [9] Niccolo Sacchi, Alexandre Nanchen, Martin Jaggi, and Milos Cernak, “Openvocabulary keyword spotting with audio and text embeddings,” in INTERSPEECH 2019IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019, number CONF.

[10]
Yougen Yuan, Zhiqiang Lv, Shen Huang, and Lei Xie,
“Verifying deep keyword spotting detection with acoustic word
embeddings,”
in
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
. IEEE, 2019, pp. 613–620.  [11] Peng Zhang and Xueliang Zhang, “Deep template matching for smallfootprint and configurable keyword spotting,” Proc. Interspeech 2020, pp. 2572–2576, 2020.
 [12] Jaesung Huh, Minjae Lee, Heesoo Heo, Seongkyu Mun, and Joon Son Chung, “Metric learning for keyword spotting,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 133–140.
 [13] Roman Vygon and Nikolay Mikhaylovskiy, “Learning efficient representations for keyword spotting with triplet loss,” arXiv preprint arXiv:2101.04792, 2021.

[14]
Abhijit Bendale and Terrance E Boult,
“Towards open set deep networks,”
in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 1563–1572.  [15] Terrance DeVries and Graham W Taylor, “Learning confidence for outofdistribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.

[16]
Pete Warden,
“Speech commands: A public dataset for singleword speech
recognition,”
Dataset available from http://download. tensorflow. org/data/speech_commands_v0
, vol. 1, 2017.  [17] P. Warden, “Speech commands: A dataset for limitedvocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.

[18]
ZiChen Fan, Zhongxin Bai, XiaoLei Zhang, Susanto Rahardja, and Jingdong Chen,
“Auc optimization for deep learning based voice activity detection,”
in ICASSP 2019  2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6760–6764.  [19] Zhongxin Bai, XiaoLei Zhang, and Jingdong Chen, “Partial auc optimization based deep speaker embeddings with classcenter learning for textindependent speaker verification,” in ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6819–6823.
 [20] Pablo Gimeno, Victoria Mingote, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida, “Generalising auc optimisation to multiclass classification for audio segmentation with limited training data,” IEEE Signal Processing Letters, 2021.
 [21] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Comments
There are no comments yet.