2019 Interspeech speech emotoin recognition paper reading

瘦欲@ 提交于 2020-01-12 07:45:30

1. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning

  1. 东京大学
  2. 端到端多任务学习with self attention,辅助任务是gender。
    首先从语谱图提取特征speech spectrogram,而不是用手工特征。然后CNN-BLSTM E2E网络。随后用self attention mechanism聚焦到情感 salient periods。最后考虑到emotion and gender classification tasks之间的相互特征,结合了性别分类作为附加task,与主要任务emotion classification share有用的信息。
  3. 摘要从人机交互应用说明SER has attracted great attention,更有画面感。介绍,分别叙述了特征、语谱图的优越性 、HMM GMM SVM等traditional machine learning approaches, CNN RNN traditional machine learning approaches。
  4. multi-headed self attention
  5. 提取语谱图:长度归一化到7.5s,不足的补零,长的cut。Hanning windows 800。sampling rate 16000Hz.
    短时傅里叶变换
  6. α\alphaβ\beta 是1
    在这里插入图片描述

实验

IEMOCAP combine EXCITED and HAPPY into HAPPY 四类 一共5531samples。
在这里插入图片描述
实验结果对比有5-fold cross-validation(2018),也有leave-one-session-out。

2. Self-attention for Speech Emotion Recognition

在这里插入图片描述

  1. “Attention is all you need”2017 Available
    based on an encoder-decoder structure that 没有使用任何 recurrence, but instead uses weighted
    correlations between the elements of the input sequence
    Transformer:把input sequence映射成a query, a key and a value
    介绍了各种attention。
  2. 提出了 a global windowing system that works works on top of the local windows.
  3. classification and regression.

实验

5 fold cross validation.
因为happy少,换成了excited,这样balance。不知道这老哥比较的对不对,[2]也是excited 5折吗?
在这里插入图片描述

3. Deep Learning of Segment-Level Feature Representation with Multiple Instance Learning for Utterance-Level Speech Emotion Recognition

  1. Segment-Level
    首先给每一个segment分类,utterance-level 的分类结果是segment-level的 aggregation
    of the segment-level decisions
    发现了(1) the aggregation of segment-level decisions provides richer information than the statistics over the low-level descriptors (LLDs) across the whole utterance;
    (2)automatic feature learning outperforms manual features
    其中SegMLP的输入是IS09(manually designed perceptual features),SegCNN的输入是 log Mel filterbanks.再分别接ELM SVM RF 一共6组实验
    automatic feature learning outperforms manually designed perceptual features
    aggregation的方法,把f matrix送入到三种分类网络
    在这里插入图片描述
  2. multiple instance learning (MIL)

实验

两个库 CASIA IEMOCAP
在这里插入图片描述

在这里插入图片描述

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!