Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract As a new type of biometrics recognition technology, speaker recognition is gaining more and more attention because of the advantages in remote authentication. In this paper, we construct an end-to-end speaker recognition model named GAPCNN in which a convolutional neural network is used to extract speaker embeddings from spectrogram, and speaker recognition is performed by the cosine similarity of embeddings. In addition, we use global average pooling instead of the traditional temporal average pooling to adapt to different voice lengths. We use the ‘dev’ set of Voxceleb2 for training, then evaluate the model in the test set of Voxceleb1, and obtain an equal error rate (EER) of 4.04%. Furthermore, we fuse our GAPCNN with the x-vector model and the thin-Resnet model with GhostVLAD, and obtain an EER of 3.01% which is better than any of the three. It indicates that GAPCNN is an important complement to the x-vector model and the thin-Resnet model with GhostVLAD. PubDate: 2022-05-21
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Natural Language Processing (NLP) has many applications such as Speech recognition, Speech understanding, and Speech synthesis. Several approaches have been proposed in the literature in dealing with NLP. This paper describes an ongoing research project that tackles Speech Arabic Synthesis using multi-agent system techniques. The system consists of five modules (agents): the User Interface Agent (UIA), the Facilitator Agent (FA), the Preprocessing Agent (PPA), the Orthographic and Phonetic Transcription Agent, and the Speech Generation Agent. These agents are communicating with each other to construct agent sub societies representing the user input. All the agents are cognitive, work together, and communicate with the Knowledge-Base and the Sound Segments Database to generate Arabic speech signals. We used 800 Arabic sentences and asked 10 listeners with different levels of knowledge of the Arab language to accomplish the evaluation perception process. The system presents in general a Success Rate of 86% for the set of 800 tested sentences. PubDate: 2022-05-17
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract There is a drastic need for extracting information from non-linguistic features of the audio sources. It leads to the eminent rise of speech technology over the past few decades. It is termed computational para-linguistics. This research concentrates on extracting and providing a robust feature that examines the characteristics of speech data. The factors are analysed in a spectral way which stimulates the auditory elements. The speech enhancement technological process is being initiated with pre-processing, feature extraction, and classification. Initially, the input data conversion is done with ADC of 16 kHz sampling frequency. The spectral features are extracted with minimal Mean Square Error to enhance the re-construction ability and eliminate the redundancy characteristics. Finally, the deep neural network is adopted for multi-class classification. The simulation is performed in MATLAB 2020a environment, and the empirical outcomes are evaluated with existing approaches. Here, metrics like Mean Square Error, accuracy, Signal-to-Noise ratio (SNR) and features retained are computed efficiently. The anticipated model shows a trade-off in contrast to prevailing approaches. The outcomes demonstrate a better recognition rate and offer significant characteristics in selecting the most influencing features. PubDate: 2022-05-16
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This paper presents a proposed approach for cancelable biometric recognition related to speaker identification. Both comb and inverse comb filters are used to distort speech signals intentionally prior to and after feature extraction. The objective of this approach is to generate protected templates of speech signals representing speakers without subjecting the original speech signals or features to violations of attackers. Both comb and inverse comb filters are used for this purpose. In addition, a satisfactory performance represented in the speaker identification rate is retained. In case the database of speaker features is compromised, it is possible to change the comb or inverse comb filter orders to generate new features for the same speakers. Simulation results reveal the possibility to identify speakers from their deteriorated speech signals, which proves the robustness of the proposed cancelable speaker identification system. The reason of choosing the comb filter is that it is a multi-band filter with multiple nulls in its frequency response. Hence, its inversion is difficult due to the nulling effect. This characteristic can induce non-invertible distortion in speech signals. PubDate: 2022-04-24
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This work presents the analysis and classification of emotional speech signals using a novel technique, variational mode decomposition on recorded speech samples. The vocal tract changes its characteristics during production of speech signals at various emotions. Variational mode decomposition, technique decomposes the speech signal into various modes. The input speech frames are divided into five intrinsic mode functions or modes. Statistical parameters like interquartile range, median absolute deviation and energy of the mode functions are extracted and mean instantaneous frequency of each mode function is also calculated and considered as features in speech emotion recognition. Extracted features are used to classify emotional speech signals using support vector machine with linear functions as kernels. This work provides recognition rate, with an accuracy of 31.2% using only mode center frequency and its statistical parameters as features. Adding of Mel frequency cepstral coefficients of speech as a feature to the mode center frequency and its statistical parameters, increases the accuracy in recognition by 12.6%. As the feature set is further added with the spectral statistical parameters of speech as features, which preserve the emotional content in frequency domain, 61.2% of accuracy is achieved in recognizing emotions using support vector machines with radial basis kernels. Mode center frequency and its statistical features contribute 31.2% in recognizing emotions from speech signals. This method provides an accuracy of 85.81% and 69.13% respectively, in recognizing two and four emotions. This particular method provides better accuracy with an increase of 5% in recognition rate compared to bidirectional long short-term memory convolutional capsule neural network proposed by Jalal et al., in recognizing eight emotions, providing better results than multitasking feature selection and convolutional neural networks methods. PubDate: 2022-04-21
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Laryngeal pathologies have a significant influence on the quality of life, verbal communication, and the human profession. Most organic vocal pathologies affect the shape and vibration pattern of the vocal fold(s). Many automatic computer-based, non-intrusive systems for rapid detection and progression tracking have been introduced in recent years. This paper proposes an integrated wavelet-based voice condition evaluation framework, which is independent of human bias and language. The true voice source is extracted using quasi-closed phase (QCP) glottal inverse filtering to capture the altered vocal fold(s) dynamics. The voice source is decomposed using stationary wavelet transform (SWT) and the fundamental frequency independent statistical and energy measures are extracted from each spectral sub-band to quantify the voice source. As the multilevel stationary wavelet decomposition leads to high-dimensional feature vector, information gain-based feature ranking process is harnessed to pick up the most discerning features. Speech samples of sustained vowel / a / mined from four distinct databases in German, Spanish, English and Arabic are used to perform different intra-and cross-database experiments. The effect of the decomposition level on detection and classification accuracy is observed and the fifth level of decomposition is found to result in the highest recognition rate. Achieved performance metrics of classifiers suggest that SWT based energy and statistical features reveal more resourceful information on pathological voices and thus the proposed system can be used as a complimentary tool for clinical diagnosis of laryngeal pathologies.Please confirm if the author names are presented accurately and in the correct sequence. Author 1 Given name: [Girish Babaji] Last name [Gidaye]. Also, kindly confirm the details in the metadata are correct. PubDate: 2022-04-21
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In this paper, we present a set of experiments aiming to improve the recognition of spoken digits for under-resourced dialects of the Maghrebi region, using a hybrid system. Indeed, integrating a Dialect Identification module into an Automatic Speech Recognition (ASR) system has shown its efficiency in previous works. In order to make the ASR system able to recognize digits spoken in different dialects, we trained our hybrid system on Moroccan Berber Dialect “MBD,” Moroccan Arabic Dialect “MAD,” and Algerian Arabic dialect “AAD,” in addition to Modern Standard Arabic. We have investigated five machine learning based classifiers and two deep learning models: the first one is based on Convolutional Neural Network (CNN), and the second one uses two pre-trained models: Residual Deep Neural Network (Resnet50 and Resnet101). The findings show that the CNN model outperforms the other proposed methods and consequently enhances the performance of spoken digit recognition system by 20% for both Algerian and Moroccan dialects. PubDate: 2022-04-15
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Deep neural networks have shown significant progress in biometric applications. Deep learning networks are particularly vulnerable to Adversarial examples where adversarial examples are manipulated input data. The adversarial attacks make the biometric system to fail in terms of performance. An effective defensive mechanism against adversarial attacks is introduced in the proposed work which is used to detect adversarial iris examples. The proposed defensive mechanism is based on Discrete Wavelet Transform (DWT) which examines the high and mid spectrum of wavelet sub bands. The model then recreates the various denoised versions of the iris images based on DWT. The U-net based Deep convolutional architecture is used for further classification. The proposed process is tested by classifying adversarial iris images affected by various adversarial attacks such as FGSM, Deepfool and iGSM methods. An experimental analysis on a benchmark iris image database, namely IITD, generates excellent results with an average accuracy of 94 percent. Experiment results show that the proposed strategy performs better in detecting adversarial attacks than other state of art defensive models. PubDate: 2022-04-12
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Word embeddings mean the mapping of words into vectors in an N-dimensional space. ArSphere: is an approach that designs word embeddings for the Arabic language. This approach overcomes one of the shortcomings of word embeddings (for English language too), namely their inability to handle opposites (and differentiate those from unrelated word pairs). To achieve that goal the vectors are embedded onto the unit sphere, rather than onto the entire space. The sphere embedding is suitable in the sense that polarity can be addressed by embedding vectors at opposite poles of the sphere. The proposed approach has several advantages. It utilizes the extensive resources developed by linguistic experts, including classic dictionaries. This is in contrast to the prevailing approach of designing the word embedding using the concept of word co-occurrence. Another advantage is that it is successful in distinguishing between synonyms, antonyms and unrelated word pairs. An algorithm to design the word embedding has been derived, and it is a simple relaxation algorithm. Being a fast algorithm allows easy update of the word vector collection, when adding new words or synonyms. The vectors are tested against a number of other published models and the results show that ArSphere outperforms the other models. PubDate: 2022-03-03 DOI: 10.1007/s10772-022-09966-9
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In this paper, the effect of emotional speech on the performance of neutral speech trained ASR systems is studied. Prosody-modification based data augmentation is explored to compensate the affected ASR performance due to emotional speech. The primary motive is to develop an Telugu ASR system that is least affected by these emotion based intrinsic speaker related acoustic variations. Two factors contributing towards the intrinsic speaker related variability that are focused in this research are the fundamental frequency [ \((F_0)\) or pitch] and the speaking rate variations. To simulate ASR task, we performed the training of our ASR system on neutral speech and tested it for data from emotional as well as neutral speech. Compared to the performance metrics of neutral speech at testing stage, emotional speech performance metrics are extremely degraded. This performance degradation is observed due to the difference in the prosody and speaking rate parameters of neutral and emotional speech. To overcome this performance degradation problem, prosody and speaking rate parameters are varied and modified to create the newer augmented versions of the training data. The original and augmented versions of the training data are pooled together and re-trained in order to capture greater emotion-specific variations. For the Telugu ASR experiments, we used Microsoft speech corpus for Indian languages(MSC-IL) for training neutral speech and Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC) for evaluating emotional speech. The basic emotions of anger, happiness and sad are considered for evaluation along with neutral speech. PubDate: 2022-03-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In the paper, we proposed a method of constructing a language corpus based on the imitation of abbreviated and transformed particles that are distinctive feature of Korean spontaneous spoken language. Since it is not practical to train a spoken-style model using numerous spoken transcripts, the proposed approach generates a spoken-style text from a written-style one such as newspapers, based on characteristics of pronouncing variations, dependent on spoken styles, of typical particles. This method for constructing spoken-style text is based on statistical analysis on particles that play same function in both of written and spoken language. We analyze grammatical functions and pronouncing features of particles that distinguish between written and spoken language, and generate spoken-style text from written-style text by imitating typical abbreviated and transformed particles which play same function. Abbreviated and transformed particles to be imitated have proper and typical pronouncing features of spoken language. We replace particles with abbreviated and transformed particles in written-style text according to correspondence of written particles to spoken ones, which results in spoken-style text. The language model, which is trained from spoken-style text imitating abbreviated and transformed particles, significantly improved a word error rate (WER) on spontaneous speech. PubDate: 2022-03-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Automatic speaker verification (ASV) systems are enhanced enough, that industry is attracted to use them practically in security systems. However, vulnerability of these systems to various direct and indirect access attacks weakens the power of ASV authentication mechanism. The increasing research in spoofing and anti-spoofing technologies is contributing to the enhancement of these systems. The objective of this paper is to review and analyze these important advancements proposed by different researchers and scientists. Various classical, autoregressive, cepstral, etc., and modern deep learning based feature extraction techniques that are chosen to design the frontend of these systems are discussed. Extracted features are learned and classified in the backend of an ASV system, which can be classical machine learning or deep learning models that are also the main focus of the presented review. Experimental studies use constantly modified datasets and evaluation measures to develop robust systems since emergence of practical work in this area. This paper analysis most of the contributing spoofed speech datasets and evaluation protocols. Speech synthesis (SS), voice conversion (VC), replay, mimicry and twins are the potential spoofing attacks to ASV systems. This work provides the knowledge of generation techniques of these attacks to empower the defence mechanism of ASV. This survey marks the start of a new era in ASV system development and highlights the start of a new generation (G4) in SS attack development methods. With the increase in advancement of deep learning techniques, the paper makes best efforts to give the complete idea of ASV to new comers to this area and also, puts some light on some of the spoofing attacks that can be targeted during implementation of the future ASV systems. PubDate: 2022-03-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Unlike other tongues, Arabic language is characterized by its written form which is essentially consonant and may not have short vowels. One of the major functions of short vowels is to determine and facilitate the meaning of words or sentences. However, MSA texts are generally written without vowels. This fact gives rise to a great deal of morphological, semantic, and syntactic ambiguities. Thus, this ambiguity problem is not only associated with Modern Standard Arabic (MSA) but also related to Arabic dialects in general and Tunisian Dialect (TD) in particular. Compared to MSA, TD suffers from the unavailability of basic tools and linguistic resources, like sufficient amount of corpora, multilingual dictionaries, morphological and syntactic analyzers of these resources makes the processing of this language a great challenge (Masmoudi et al., 2020). Despite the numerous efforts currently underway, still some shortages persist in this field. Hence, we tried to challenge this lack by presenting our work that investigates the automatic diacritization of TD texts. In this respect, we regard the diacritization problem as a simplified phrase-based SMT (Statistical Machine Translation) task. The source language is the undiacritic text while the target language is the diacritic text. We initially go deeper into the details of TD corpus creation. This corpus is finally approved and used to build a diacritic restoration system for the TD. It is called TDTACHKIL and it can achieve a Word Error Rate (WER) of 16.7% and Diacritic Error Rate (DER) of 8.89%. PubDate: 2022-03-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Recurrent neural networks have encountered a wide success in different domains due to their high capability to code short- and long-term dependencies between basic features of a sequence. Different RNN units have been proposed to well manage the term dependencies with an efficient algorithm that requires few basic operations to reduce the processing time needed to learn the model. Among these units, the internal memory gate (IMG) have produce efficient accuracies faster than LSTM and GRU during a SLU task. This paper presents the bidirectional internal memory gate recurrent neural network (BIMG) that codes short- and long-term dependencies in forward and backward directions. Indeed, the BIMG is composed with IMG cells made of an unique gate managing short- and long-term dependencies by combining the advantages of the LSTM, GRU (short- and long-term dependencies) and the leaky unit (LU) (fast learning). The effectiveness and the robustness of the proposed BIMG-RNN is evaluated during a theme identification task of telephone conversations. The experiments show that BIMG reaches better accuracies than BGRU and BLSTM with a gain of 1.1 and a gain of 2.1 with IMG model. Moreover, BIMG requires less processing time than BGRU and BLSTM with a gain of 12% and 35% respectively. PubDate: 2022-03-01
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract End-to-end speech synthesis methods managed to achieve nearly natural and human-like speech. They are prone to some synthesis errors such as missing or repeating words, or incomplete synthesis. We may argue this is mainly due to the local information preference between text input and the learned acoustic features of a conditional autoregressive (CAR) model. The local information preference prevents the model from depending on text input when predicting acoustic features. It contributes to synthesis errors during inference time. In this work, we are comparing two modified architectures based on Tacotron2 to generate Arabic speech. The first architecture replaces the WaveNet vocoder with a flow-based implementation of WaveGlow. The second architecture, influenced by InfoGan, maximizes the mutual information between text input and predicted acoustic features (mel-spectrogram) to eliminate the local information preference. The training objective has been also changed by adding a CTC loss term. The training objective could be considered as a metric of local information preference between text input and predicted acoustic features. We carried the experiments on Nawar Halabi’s dataset (http://en.arabicspeechcorpus.com/) which contains about 2.41 h of Arabic speech. Our experiments show that maximizing mutual information between predicted acoustic features and conditional text input as well as changing the training objective can enhance the subjective quality of generated speech and reduce the utterance error rate. PubDate: 2022-02-08 DOI: 10.1007/s10772-022-09961-0
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The automatic speaker verification (ASV) has recently achieved great progress. However, the performance of ASV degrades significantly when the test speech is corrupted by interference speakers, especially when multi-talkers speak at the same time. Although the target speech extraction (TSE) has also attracted increasing attention in recent years, its TSE ability is constrained by the required pre-saved anchor speech examples of the target speaker. It becomes impossible to directly use existing TSE methods to extract the desired test speech in an ASV test trial, because the speaker identity of each test speech is unknown. Therefore, based on the state-of-the-art single channel speech separation technique—Conv-TasNet, this paper aims to design a test speech extraction mechanism for building short-time text-dependent speaker verification systems. Instead of providing a pre-saved anchor speech for each training or test speaker, we extract the desired test speech from a mixture by computing the pairwise dynamic time warping between each output of Conv-TasNet and the enrollment utterance of speaker model in each test trial in the ASV task. The acoustic domain mismatch between ASV and TSE training data, the behaviors of speech separation in different stages of ASV system building, such as, the voiceprint enrollment, test and PLDA backend are all investigated in detail. Experimental results show that the proposed test speech extraction mechanism in ASV brings significant relative improvements (36.3%) in overlapped multi-talker speaker verification, benefits can be found not only in ASV test stage, but also in target speaker modeling. PubDate: 2022-01-13 DOI: 10.1007/s10772-022-09959-8
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract A novel scheme for disambiguating conflicting classification results in Audio-Visual Speech Recognition applications is proposed in this paper. The classification scheme can be implemented with both generative and discriminative models and can be used with different input modalities, viz. only audio, only visual, and audio visual information. The proposed scheme consists of the cascade connection of a standard classifier, trained with instances of each particular class, followed by a complementary model which is trained with instances of all the remaining classes. The performance of the proposed recognition system is evaluated on three publicly available audio-visual datasets, and using a generative model, namely a Hidden Markov model, and three discriminative techniques, viz. random forests, support vector machines, and adaptive boosting. The experimental results are promising in the sense that for the three datasets, the different models, and the different input modalities, improvements in the recognition rates are achieved in comparison to other methods reported in the literature over the same datasets. PubDate: 2022-01-07 DOI: 10.1007/s10772-021-09944-7
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This paper aims to propose information hiding with the variant like steganography and watermarking in proposed 10.6 kbps Conjugate Structure-Algebraic Code Excited Linear Prediction (CS-ACELP) speech codec. Proposed work makes use of Dither Modulation-Quantization Index Modulation technique to incorporate steganography or watermarking for information hiding in the excitation codebook code vector of proposed 10.6 kbps CS-ACELP based speech codec. Codebook partition and label assignment approach is explored in proposed coder in order to create room of 10 bits per frame for steganographic data transmission. Joint source coding and data hiding approach is adopted for steganographic or watermarked data transmission. Performance of the proposed approach is evaluated with different objective and subjective parameters in terms of tables and graphs. Information hiding capacity is demonstrated with different parameters like the watermark to signal ratio, hiding capacity and embedding capacity in terms of percentage. Moreover, the results of subjective and objective parameters of the proposed algorithm are analysed by computing population mean of 99% confidence interval to prove the consistency of it. PubDate: 2021-12-03 DOI: 10.1007/s10772-021-09949-2