Subjects -> DISABILITY (Total: 100 journals)
|
|
|
- Improving multilingual speech emotion recognition by combining acoustic
features in a three-layer model- Abstract: Publication date: July 2019Source: Speech Communication, Volume 110Author(s): Xingfeng Li, Masato AkagiAbstractThis study presents a scheme for multilingual speech emotion recognition. Determining the emotion of speech in general relies upon specific training data, and a different target speaker or language may present significant challenges. In this regard, we first explore 215 acoustic features from emotional speech. Second, we carry out speaker normalization and feature selection to develop a shared standard acoustic parameter set for multiple languages. Third, we use a three-layer model composed of acoustic features, semantic primitives, and emotion dimensions to map acoustics into emotion dimensions. Finally, we classify the continuous emotion dimensional values into basic categories by using the logistic model trees. The proposed approach was tested on Japanese, German, Chinese, and English emotional speech corpora. The recognition performance was examined and enhanced by cross-speaker and cross-corpus evaluation, and stressed the fact that our strategy is particularly suited for the task of multilingual emotion recognition even with a different speaker or language. The experimental results were found to be reasonably comparable with those of monolingual emotion recognizers as a reference.
- New insights on the optimality of parameterized Wiener filters for speech
enhancement applications- Abstract: Publication date: May 2019Source: Speech Communication, Volume 109Author(s): Rafael Attili Chiea, Márcio Holsbach Costa, Guillaume BarraultAbstractThis work presents a unified framework for defining a family of noise reduction techniques for speech enhancement applications. The proposed approach provides a unique theoretical foundation for some widely-applied soft and hard time-frequency masks, which encompasses the well-known Wiener filter and the heuristically-designed Binary mask. These techniques can now be considered as optimal solutions of the same minimization problem. The proposed cost function is defined by two design parameters that not only establish a desired trade-off between noise reduction and speech distortion, but also provide an insightful relationship with the mask morphology. Such characteristic may be useful for applications that require online adaptation of the suppression function according to variations of the acoustic scenario. Simulation examples indicate that the derived conformable suppression mask has approximately the same quality and intelligibility performance capability of the classical heuristically-defined parametric Wiener filter. The proposed approach may be of special interest for real-time embedded speech enhancement applications such as hearing aids and cochlear implants.
- Low-rank and sparse subspace modeling of speech for DNN based acoustic
modeling- Abstract: Publication date: May 2019Source: Speech Communication, Volume 109Author(s): Pranay Dighe, Afsaneh Asaei, Hervé BourlardAbstractTowards the goal of improving acoustic modeling for automatic speech recognition (ASR), this work investigates the modeling of senone subspaces in deep neural network (DNN) posteriors using low-rank and sparse modeling approaches. While DNN posteriors are typically very high-dimensional, recent studies have shown that the true class information is actually embedded in low-dimensional subspaces. Thus, a matrix of all posteriors belonging to a particular senone class is expected to have a very low rank. In this paper, we exploit Principal Component Analysis and Compressive Sensing based dictionary learning for low-rank and sparse modeling of senone subspaces respectively. Our hypothesis is that the principal components of DNN posterior space (termed as eigen-posteriors in this work) and Compressive Sensing dictionaries can act as suitable models to extract the well-structured low-dimensional latent information and discard the undesirable high-dimensional unstructured noise present in the posteriors. Enhanced DNN posteriors thus obtained are used as soft targets for training better acoustic models to improve ASR. In this context, our approach also enables improving distant speech recognition by mapping far-field acoustic features to low-dimensional senone subspaces learned from near-field features. Experiments are performed on AMI Meeting corpus in both close-talk (IHM) and far-field (SDM) microphone settings where acoustic models trained using enhanced DNN posteriors outperform the conventional hard target based hybrid DNN-HMM systems. An information theoretic analysis is also presented to show how low-rank and sparse enhancement modify the DNN posterior space to better match the assumptions of hidden Markov model (HMM) backend.
- Analysis of phonation onsets in vowel production, using information from
glottal area and flow estimate- Abstract: Publication date: May 2019Source: Speech Communication, Volume 109Author(s): Tiina Murtola, Jarmo Malinen, Ahmed Geneid, Paavo AlkuAbstractA multichannel dataset comprising high-speed videoendoscopy images, and electroglottography and free-field microphone signals, was used to investigate phonation onsets in vowel production. Use of the multichannel data enabled simultaneous analysis of the two main aspects of phonation, glottal area, extracted from the high-speed videoendoscopy images, and glottal flow, estimated from the microphone signal using glottal inverse filtering. Pulse-wise parameterization of the glottal area and glottal flow indicate that there is no single dominant way to initiate quasi-stable phonation. The trajectories of fundamental frequency and normalized amplitude quotient, extracted from glottal area and estimated flow, may differ markedly during onsets. The location and steepness of the amplitude envelopes of the two signals were observed to be closely related, and quantitative analysis supported the hypothesis that glottal area and flow do not carry essentially different amplitude information during vowel onsets. Linear models were used to predict the phonation onset times from the characteristics of the subsequent steady phonation. The phonation onset time of glottal area was found to have good predictability from a combination of the fundamental frequency and the normalized amplitude quotient of the glottal flow, as well as the gender of the speaker. For the phonation onset time of glottal flow, the best linear model was obtained using the fundamental frequency and the normalized amplitude quotient of the glottal flow as predictors.
- Speech-Driven Animation with Meaningful Behaviors
- Abstract: Publication date: Available online 5 April 2019Source: Speech CommunicationAuthor(s): Najmeh Sadoughi, Carlos BussoAbstractConversational agents (CAs) play an important role in human computer interaction (HCI). Creating believable movements for CAs is challenging, since the movements have to be meaningful and natural, reflecting the coupling between gestures and speech. Studies in the past have mainly relied on rule-based or data-driven approaches. Rule-based methods focus on creating meaningful behaviors conveying the underlying message, but the gestures cannot be easily synchronized with speech. Data-driven approaches, especially speech-driven models, can capture the relationship between speech and gestures. However, they create behaviors disregarding the meaning of the message. This study proposes to bridge the gap between these two approaches overcoming their limitations. The approach builds a dynamic Bayesian network (DBN), where a discrete variable is added to constrain the behaviors on the underlying constraint. The study implements and evaluates the approach with two constraints: discourse functions and prototypical behaviors. By constraining on the discourse functions (e.g., questions), the model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. By constraining on prototypical behaviors (e.g., head nods), the approach can be embedded in a rule-based system as a behavior realizer creating trajectories that are timely synchronized with speech. The study proposes a DBN structure and a training approach that (1) models the cause-effect relationship between the constraint and the gestures, and (2) captures the differences in the behaviors across constraints by enforcing sparse transitions between shared and exclusive states per constraint. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained baseline model.
- Output-based Speech Quality Assessment Using Autoencoder and Support
Vector Regression- Abstract: Publication date: Available online 2 April 2019Source: Speech CommunicationAuthor(s): Jing Wang, Yahui Shan, Xiang Xie, Jingming KuangAbstractThe output-based speech quality assessment method has been widely used and received increasing attention since it does not need undistorted signals as reference. In order to obtain a high correlation between the predicted scores and subjective results, this paper presents a new speech quality assessment method to estimate the quality of degraded speech without the reference speech. Bottleneck features are extracted with autoencoder and support vector regression is chosen as mapping model from objective representation to subjective scores. Experiments are conducted in a dataset containing various degraded speech signals and subjective listening scores. The proposed method takes advantage of autoencoder in forming a good representation of its input which can be better mapped to Mean Opinion Score. The experimental results show that compared with the standardization ITU-T P.563 and another deep learning-based assessment method, the proposed method brings about a higher correlation coefficient between predicted scores and subjective scores.
- Speech Enhancement using ultrasonic doppler sonar
- Abstract: Publication date: Available online 2 April 2019Source: Speech CommunicationAuthor(s): Ki-Seung LeeAbstractThe quality of speech reproduced using conventional single-channel speech enhancement schemes is seriously affected by acoustic noise level. Nonacoustic sensors have the ability to reveal certain speech attributes that are lost in noisy acoustic signals. This study validated the use of ultrasonic doppler frequency shifts caused by facial movements for enhancing audio speech contaminated by high levels of acoustic noise. A 40kHz ultrasonic beam is incident to a speaker’s face. The received signals were first demodulated and converted to a spectral feature parameter. The spectral feature derived from the ultrasonic Doppler signal (UDS) was concatenated with spectral features from noisy speech, which were then used to estimate the magnitude of the spectrum of clean speech. A nonlinear regression approach was employed in this estimation where the relationship between audio-UDS features and the corresponding clean speech is represented by deep neural networks (DNN). The feasibility of the proposed enhancement method was tested on a 1 hour audio-UDS corpus and four different types of noise data. The results showed that, both objectively and subjectively, the best performance was obtained when the audio and UDS were used cooperatively. A correlation analysis was also carried out to investigate the usefulness of multi-directional ultrasonic sensing. The results showed that the performance was affected by the number of the adopted UDS channels, particularly in cases of low levels of SNRs.
- Speaker recognition using PCA-based feature transformation
- Abstract: Publication date: Available online 2 April 2019Source: Speech CommunicationAuthor(s): Ahmed Isam Ahmed, John Chiverton, David Ndzi, Victor BecerraAbstractThis paper introduces a Weighted-Correlation Principal Component Analysis (WCR-PCA) for efficient transformation of speech features in speaker recognition. A Recurrent Neural Network (RNN) technique is also introduced to perform the weighted PCA. The weights are taken as the log-likelihood values from a fitted Single Gaussian-Background Model (SG-BM). For speech features, we show that there are large differences between feature variances which makes covariance based PCA less optimal. A comparative study of the performance of speaker recognition is presented using weighted and unweighted correlation and covariance based PCA. Extensions to improve the extraction of MFCC and LPCC features of speech are also proposed. These are Odd Even filter banks MFCC (OE-MFCC) and Multitaper-Fitted LPCC. The methodologies are evaluated for the i-vector speaker recognition system. A subset of the 2010 NIST speaker recognition evaluation set is used in the performance testing in addition to evaluations on the VoxCeleb1 dataset. A relative improvement of 44% in terms of EER is found in the system performance using the NIST data and 18% using the VoxCeleb1 dataset.
- Text normalization using memory augmented neural networks
- Abstract: Publication date: May 2019Source: Speech Communication, Volume 109Author(s): Subhojeet Pramanik, Aman HussainAbstractWe perform text normalization, i.e. the transformation of words from the written to the spoken form, using a memory augmented neural network. With the addition of dynamic memory access and storage mechanism, we present a neural architecture that will serve as a language-agnostic text normalization system while avoiding the kind of unacceptable errors made by the LSTM-based recurrent neural networks. By successfully reducing the frequency of such mistakes, we show that this novel architecture is indeed a better alternative. Our proposed system requires significantly lesser amounts of data, training time and compute resources. Additionally, we perform data up-sampling, circumventing the data sparsity problem in some semiotic classes, to show that sufficient examples in any particular class can improve the performance of our text normalization system. Although a few occurrences of these errors still remain in certain semiotic classes, we demonstrate that memory augmented networks with meta-learning capabilities can open many doors to a superior text normalization system.
- Speech reverberation suppression for time-varying environments using
weighted prediction error method with time-varying autoregressive model- Abstract: Publication date: May 2019Source: Speech Communication, Volume 109Author(s): Mahdi Parchami, Hamidreza Amindavar, Wei-Ping ZhuAbstractIn this paper, a novel approach for the task of speech reverberation suppression in non-stationary (changing) acoustic environments is proposed. The suggested approach is based on the popular weighted prediction error (WPE) method, yet, instead of considering fixed reverberation prediction weights, our method takes into account the more generic time-varying autoregressive (TV-AR) model which allows dynamic estimation and updating for the prediction weights over time. We use an initial estimate of the prediction weights in order to optimally select the TV-AR model order and also to calculate the TV-AR coefficients. Next, by properly interpolating the calculated coefficients, we obtain the ultimate estimate of reverberation prediction weights. Performance evaluation of the proposed approach is shown not only for fixed acoustic rooms but also for environments where the source and/or sensors are moving. Our experiments reveal further reverberation suppression as well as higher quality in the enhanced speech samples in comparison with recent literature within the context of speech dereverberation.
- Temporal envelope cues and simulations of cochlear implant signal
processing- Abstract: Publication date: Available online 21 March 2019Source: Speech CommunicationAuthor(s): Raymond L. GoldsworthyABSTRACTConventional signal processing implemented on clinical cochlear implant (CI) sound processors is based on envelope signals extracted from overlapping frequency regions. Conventional strategies do not encode temporal envelope or temporal fine-structure cues with high fidelity. In contrast, several research strategies have been developed recently to enhance the encoding of temporal envelope and fine-structure cues. The present study examines the salience of temporal envelope cues when encoded into vocoder representations of CI signal processing.Normal-hearing listeners were evaluated on measures of speech reception, speech quality ratings, and spatial hearing when listening to vocoder representations of CI signal processing. Conventional vocoder techniques using envelope signals with noise- or tone-excited reconstruction were evaluated in comparison to a novel approach based on impulse-response reconstruction. A variation of this impulse-response approach was based on a research strategy, the Fundamentally Asynchronous Stimulus Timing (FAST) algorithm, designed to improve temporal precision of envelope cues.The results indicate that the introduced impulse-response approach, combined with the FAST algorithm, produces similar results on speech reception measures as the conventional vocoder approaches, while providing significantly better sound quality and spatial hearing outcomes. This novel approach for stimulating how temporal envelope cues are encoded into CI stimulation has potential for examining diverse aspects of hearing, particularly in aspects of musical pitch perception and spatial hearing.
|