Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board
<< Previous Next >>

2021, 3(1): 65-75

Published Date:2021-2-20 DOI: 10.1016/j.vrih.2020.11.006

Multi-scale discrepancy adversarial network for cross-corpus speech emotion recognition


One of the most critical issues in human-computer interaction applications is recognizing human emotions based on speech. In recent years, the challenging problem of cross-corpus speech emotion recognition (SER) has generated extensive research. Nevertheless, the domain discrepancy between training data and testing data remains a major challenge to achieving improved system performance.
This paper introduces a novel multi-scale discrepancy adversarial (MSDA) network for conducting multiple timescales domain adaptation for cross-corpus SER,
i. e., 
integrating domain discriminators of hierarchical levels into the emotion recognition framework to mitigate the gap between the source and target domains. Specifically, we extract two kinds of speech features, i.e., handcraft features and deep features, from three timescales of global, local, and hybrid levels. In each timescale, the domain discriminator and the feature extrator compete against each other to learn features that minimize the discrepancy between the two domains by fooling the discriminator.
Extensive experiments on cross-corpus and cross-language SER were conducted on a combination dataset that combines one Chinese dataset and two English datasets commonly used in SER. The MSDA is affected by the strong discriminate power provided by the adversarial process, where three discriminators are working in tandem with an emotion classifier. Accordingly, the MSDA achieves the best performance over all other baseline methods.
The proposed architecture was tested on a combination of one Chinese and two English datasets. The experimental results demonstrate the superiority of our powerful discriminative model for solving cross-corpus SER.


Human-computer interaction ; Cross-corpus speech emotion recognition ; Hierarchical discri-minators ; Domain adaptation

Cite this article

Wanlu ZHENG, Wenming ZHENG, Yuan ZONG. Multi-scale discrepancy adversarial network for cross-corpus speech emotion recognition. Virtual Reality & Intelligent Hardware, 2021, 3(1): 65-75 DOI:10.1016/j.vrih.2020.11.006


1. Swain M, Routray A, Kabisatpathy P. Databases, features and classifiers for speech emotion recognition: a review. International Journal of Speech Technology, 2018, 21(1): 93–120 DOI:10.1007/s10772-018-9491-z

2. Vrysas N, Kotsakis R, Liatsou A, Dimoulas C, Kalliris G. Speech emotion recognition for performance interaction. Journal of the Audio Engineering Society, 2018, 66(6): 457–467 DOI:10.17743/jaes.2018.0036

3. Kotsakis R, Dimoulas C, Kalliris G, Veglis A. Emotional prediction and content profile estimation in evaluating audiovisual mediated communication. International Journal of Monitoring and Surveillance Technologies Research, 2014, 2(4): 62–80 DOI:10.4018/ijmstr.2014100104

4. Gideon J, McInnis M, Mower Provost E. Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG). IEEE Transactions on Affective Computing, 2019: 1 DOI:10.1109/taffc.2019.2916092

5. Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G. Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Transactions on Affective Computing, 2010, 1(2): 119–131 DOI:10.1109/t-affc.2010.8

6. Ntalampiras S. Toward language-agnostic speech emotion recognition. Journal of the Audio Engineering Society, 2020, 68(1/2): 7–13 DOI:10.17743/jaes.2019.0045

7. Sahu S, Gupta R, Espy-Wilson C. On enhancing speech emotion recognition using generative adversarial networks. In: Interspeech 2018. ISCA, 2018 DOI:10.21437/interspeech.2018-1883

8. Bao F, Neumann M, Vu N T. CycleGAN-based emotion style transfer as data augmentation for speech emotion recognition. In: Interspeech 2019. ISCA, 2019, 35–37 DOI:10.21437/interspeech.2019-2293

9. Han J, Zhang Z, Ren Z, Ringeval F, Schuller B. Towards conditional adversarial training for predicting emotions from speech. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018, 6822–6826 DOI:10.1109/ICASSP.2018.8462579

10. Salamon J, Bello J P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 2017, 24(3): 279–283 DOI:10.1109/lsp.2017.2657381

11. Abdelwahab M, Busso C. Domain adversarial for acoustic emotion recognition. ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(12): 2423–2435 DOI:10.1109/taslp.2018.2867099

12. Pan S J, Tsang I W, Kwok J T, Yang Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011, 22(2): 199–210 DOI:10.1109/tnn.2010.2091281

13. Ververidis D, Kotropoulos C. Emotional speech recognition: resources, features, and methods. Speech Communication, 2006, 48(9): 1162–1181 DOI:10.1016/j.specom.2006.04.003

14. Fayek H M, Lech M, Cavedon L. Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 2017, 92: 60–68 DOI:10.1016/j.neunet.2017.02.013

15. Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A. Acoustic emotion recognition: A benchmark comparison of performances. In: 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. 2009, 552–557 DOI:10.1109/ASRU.2009.5372886

16. Jeon J H, Xia R, Liu Y. Sentence level emotion recognition based on decisions from subsentence segments. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011, 4940–4943 DOI:10.1109/ICASSP.2011.5947464

17. Vryzas N, Vrysis L, Matsiola M, Kotsakis R, Dimoulas C, Kalliris G. Continuous speech emotion recognition with convolutionacnnl neural networks. Journal of the Audio Engineering Society, 2020. 68(1/2), 14–24

18. Zhang S, Huang T, Gao W. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia, 2018, 20(6):1576–1590 DOI:10.1109/TMM.2017.2766843

19. Vrysis L, Tsipas N, Thoidis I, Dimoulas C. 1D/2D deep CNNs vs. temporal feature integration for general audio classification. Journal of the Audio Engineering Society, 2020, 68(1/2), 66–77

20. Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing. MIT Press, 2014, 2672–2680

21. Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V. Domain-Adversarial Training of Neural Networks. The Journal of Machine Learning Research, 2016, 17(1): 2096–2030

22. Müller C. The interspeech 2010 paralinguistic challenge. Proc Interspeech, 2010, 2794–2797

23. Eyben F, Wöllmer M, Schuller B. Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on Multimedia. Firenze, Italy, Association for Computing Machinery, 2010, 1459–1462 DOI:10.1145/1873951.1874246

24. Qian K, Zhang Y, Chang S, Cox D, Hasegawa-Johnson M. Unsupervised speech decomposition via triple information bottleneck. 2020

25. Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Transactions on Affective Computing 2012, 3(2):184–198 DOI:10.1109/T-AFFC.2011.40

26. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. Computer Science, 2014

27. Vinyals O, Kaiser L, Koo T, Petrov S, Sutskever I, Hinton G. Grammar as a foreign language. In: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2. Montreal, Canada, Press MIT, 2015, 2773–2781

28. Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S, Narayanan S S. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42(4): 335 DOI:10.1007/s10579-008-9076-6

29. The selected speech emotion database of institute of automation chinese academy of sciences (CASIA). http://www.datatang.com/ data/39277

30. Busso C, Parthasarathy S, Burmania A, AbdelWahab M, Sadoughi N, Provost E M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 2017, 8(1):67–80 DOI:10.1109/TAFFC.2016.2515617

31. Li H, Tu M, Huang J, Narayanan S, Georgiou P. Speaker-invariant affective representation learning via adversarial training. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, 7144–7148 DOI:10.1109/ICASSP40776.2020.9054580.

32. Sayan G, Eugene L, Louis-Philippe M, Stefan S. Representation learning for speech emotionrecognition. Interspeech, 2016, 3603–3607

33. Xu Y, Xu H, Zou J. HGFM: A hierarchical grained and feature model for acoustic emotion recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020, 6499–6503 DOI:10.1109/ICASSP40776.2020.9053039


1. Chongyang SUN, Weizhi NAI, Xiaoying SUN, Tactile sensitivity in ultrasonic haptics: Do different parts of hand and different rendering methods have an impact on perceptual threshold? Virtual Reality & Intelligent Hardware 2019, 1(3): 265-275

2. Egemen ERTUGRUL, Ping LI, Bin SHENG, On attaining user-friendly hand gesture interfaces to control existing GUIs Virtual Reality & Intelligent Hardware 2020, 2(2): 153-161

3. Shiwei CHENG, Qiufeng PING, Jialing WANG, Yijian CHEN, EyeGaze: Hybrid eye tracking approach for handheld mobile devices Virtual Reality & Intelligent Hardware 2022, 4(2): 173-188

4. Mohammad Mahmudul ALAM, S. M. Mahbubur RAHMAN, Affine transformation of virtual 3D object using 2D localization of fingertips Virtual Reality & Intelligent Hardware 2020, 2(6): 534-555

5. Yuanyuan SHI, Yunan LI, Xiaolong FU, Kaibin MIAO, Qiguang MIAO, Review of dynamic gesture recognition Virtual Reality & Intelligent Hardware 2021, 3(3): 183-206

6. Xuezhi YAN, Qiushuang WU, Xiaoying SUN, Electrostatic tactile representation in multimedia mobile terminal Virtual Reality & Intelligent Hardware 2019, 1(2): 201-218

7. Xiaoxiong FAN, Yun CAI, Yufei YANG, Tianxing XU, Yike Li, Songhai ZHANG, Fanglue ZHANG, Detection of scene-irrelevant head movements via eye-head coordination information Virtual Reality & Intelligent Hardware 2021, 3(6): 501-514