Slim Essid's research page

Research interests

Machine Learning, Artificial Intelligence and Signal Processing, especially:

multimodal and multiview learning;
representation learning, in particular, self-supervised learning;
structured prediction;

with applications to:

multimodal language models, especially audio-vision language models;
speech processing, machine listening, music content analysis (MIR);
multimodal perception, social and affective computing;
physiological, especially EEG, data analysis.

For more information about my research activities check my publications. You can also read about the research projects I have been involved in, including those of the PhD students and post-docs I have advised.

News

Apr. 30th 2026: Our paper on Multiple Choice Learning of Low-Rank Adapters was accepted at ICML 2026.
Mar. 9th 2026: Our PhD student Elio Gruttadauria successfully defended his thesis
Feb. 20th 2026: Our paper on Training-free Sound Prompted Segmentation was accepted for publication in TMLR.
Feb. 13th 2026: Our PhD student Aurian Quelennec successfully defended his thesis
Jan. 17th 2026: 2 new papers accepted at ICASSP 2026.
Dec. 12th 2025: Our PhD student Yasser Benigmim successfully defended his thesis
Aug. 20th 2025: 1 paper accepted at EMNLP 2025.
Jul. 7th 2025: 1 paper accepted at WASPAA 2025.
Jun. 1st 2025: 1 paper accepted at Interspeech 2025.
Feb. 2nd 2025: Our PhD student David Perera successfully defended his thesis
Dec. 12th 2024: 5 papers accepted at ICASSP 2025.
Nov. 6th 2024: Our PhD student Morgan Buisson successfully defended his thesis
Sep. 25th 2024: 2 papers accepted at NeurIPS 2024.

Short bio

Slim Essid is an Applied Research Manager at NVIDIA, which he joined in June 2025. Previously, he was Full Professor of Télécom Paris and the coordinator of the Audio Data Analysis and Signal Processing (ADASP) group. He received the state engineering degree from the École Nationale d’Ingénieurs de Tunis in 2001; the M.Sc. (D.E.A.) degree in digital communication systems from the École Nationale Supérieure des Télécommunications, Paris, France, in 2002; the Ph.D. degree from the Université Pierre et Marie Curie (UPMC), in 2005; and the habilitation (HDR) degree from UPMC in 2015.

Over the past 20 years, he has been involved in various French and European collaborative research projects. He has collaborated with 18 post-docs and research engineers and has graduated 22 PhD students; he is currently co-advising 5 others. He has published over 150 peer-reviewed conference and journal papers with more than 100 distinct co-authors. On a regular basis he serves as a reviewer for various machine learning, signal processing, audio and multimedia conferences and journals, for instance various IEEE transactions, and as an expert for research funding agencies.

Selected recent publications

MULTIPLE CHOICE LEARNING OF LOW-RANK ADAPTERS FOR LANGUAGE MODELING

V. Letzelter, H. Malard, M. Fontaine, G. Richard, S. Essid, A. Bursuc, and P. Pérez

In Forty-third International Conference on Machine Learning (ICML) , 2026

Abs Bib PDF

We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple “futures” may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All loss to efficiently handle ambiguity through Low-Rank Adaptation. We provide a theoretical interpretation of applying MCL to language modeling, assuming the data is generated from a mixture of distributions. We illustrate the proposed approach using mixtures of Markov chains. We then demonstrate with experiments on visual and audio captioning, as well as machine translation, that our method achieves high diversity and relevance in generated outputs.
@inproceedings{VL:ICML-26, title = {Multiple Choice Learning of Low-Rank Adapters for Language Modeling}, author = {Letzelter, Victor and Malard, Hugo and Fontaine, Mathieu and Richard, Gaël and Essid, Slim and Bursuc, Andrei and Pérez, Patrick}, booktitle = {Forty-third International Conference on Machine Learning (ICML)}, year = {2026}, url = {https://openreview.net/forum?id=CCO35e4DCO}, }
TACO: TRAINING-FREE SOUND PROMPTED SEGMENTATION VIA SEMANTICALLY CONSTRAINED AUDIO-VISUAL CO-FACTORIZATION

H. Malard, M. Olvera, S. Lathuiliere, and S. Essid

Transactions on Machine Learning Research, 2026

Abs Bib PDF

Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.
@article{HM:TMLR:2026, title = {TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization}, journal = {Transactions on Machine Learning Research}, year = {2026}, author = {Malard, Hugo and Olvera, Michel and Lathuiliere, Stephane and Essid, Slim}, }
TINYMU: A COMPACT AUDIO-LANGUAGE MODEL FOR MUSIC UNDERSTANDING

X. Li, A. Quelennec, and S. Essid

In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2026

Abs Bib

Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answering music-related questions by following user instructions. However, their massive scale, often billions of parameters, results in expensive training, slow inference, and limited deployability on edge devices. In this work, we present TinyMU, a lightweight (229M) Music-Language Model (MLM) that achieves performance comparable to much larger LALMs while remaining efficient and compact. To train TinyMU, we introduce MusicSkills-3.5M, a carefully curated, music-grounded question-answering dataset with 3.5M samples. Spanning multiple-choice, binary, and open-ended formats, this dataset provides fine-grained supervision across diverse musical concepts. For its architecture, TinyMU leverages MATPAC++, the SOTA self-supervised audio encoder for fine-grained feature extraction. Paired with a lightweight linear projector, it efficiently aligns audio embeddings with the language model. Through extensive evaluation, we show that TinyMU performs strongly in both basic music understanding and complex reasoning. Notably, on the MuChoMusic benchmark, it achieves 82% of SOTA LALM’s performance despite being 35× smaller, highlighting the potential of small MLMs under constrained computational budgets.
@inproceedings{MA-ICASSP-26, title = {{TINYMU: A Compact Audio-Language Model for Music Understanding}}, author = {Li, Xiquan and Quelennec, Aurian and Essid, Slim}, booktitle = {ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, address = {Barcelona, Spain}, year = {2026}, month = may }
MATPAC++: ENHANCED MASKED LATENT PREDICTION FOR SELF-SUPERVISED AUDIO REPRESENTATION LEARNING

A. Quelennec, P. Chouteau, G. Peeters, and S. Essid

May 2025

Pre-print

Abs Bib PDF

Masked latent prediction has emerged as a leading paradigm in self-supervised learning (SSL), especially for general audio and music representation learning. While recent methods have demonstrated strong performance, the role of the predictor module used at the output of such SSL systems remains mainly overlooked, despite being crucial for solving the pretext task at hand. In particular, this module should be able to deal with the ambiguity inherent in audio content, especially when it is composed of multiple sound sources. This work proposes a novel enhancement: integrating Multiple Choice Learning (MCL) to explicitly model prediction ambiguity and improve representation quality. We build on top of the recently proposed MATPAC system, improving its prediction and unsupervised classification pretext tasks with MCL. We extensively evaluate our method, MATPAC++, through both linear probing across multiple downstream tasks and fine-tuning on AudioSet, employing a unified protocol that enables rigorous and fair comparisons with state-of-the-art SSL approaches. Results show that our proposal achieves state-of-the-art when fine-tuned on AudioSet and overall state-of-the-art scores on downstream tasks. Additionally, we examine domain specialisation by training exclusively on music data, where our model achieves state-of-the-art performance with significantly improved efficiency.
@misc{quelennec2025matpacenhancedmaskedlatent, title = {MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning}, author = {Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim}, year = {2025}, eprint = {2508.12709}, archiveprefix = {arXiv}, primaryclass = {cs.SD}, note = {Pre-print} }
IKNOW-AUDIO: INTEGRATING KNOWLEDGE GRAPHS WITH AUDIO-LANGUAGE MODELS

M. Olvera, C. Wang, P. Stamatiadis, G. Richard, and S. Essid

In The 2025 Conference on Empirical Methods in Natural Language Processing , May 2025

Abs Bib PDF

Contrastive Language–Audio Pretraining (CLAP) models learn by aligning audio and text in a shared embedding space, enabling powerful zero-shot recognition. However, their performance is highly sensitive to prompt formulation and language nuances, and they often inherit semantic ambiguities and spurious correlations from noisy pretraining data. While prior work has explored prompt engineering, adapters, and prefix tuning to address these limitations, the use of structured prior knowledge remains largely unexplored. We present iKnow-audio, a framework that integrates knowledge graphs with audio-language models to provide robust semantic grounding. iKnow-audio builds on the Audio-centric Knowledge Graph (AKG), which encodes ontological relations comprising semantic, causal, and taxonomic connections reflective of everyday sound scenes and events. By training knowlege graph embedding models on the AKG and refining CLAP predictions through this structured knowledge, iKnow-audio improves disambiguation of acoustically similar sounds and reduces reliance on prompt engineering. Comprehensive zero-shot evaluations across six benchmark datasets demonstrate consistent gains over baseline CLAP, supported by embedding-space analyses that highlight improved relational grounding.
@inproceedings{olvera2025iknowaudio, title = {iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models}, author = {Olvera, Michel and Wang, Changhong and Stamatiadis, Paraskevas and Richard, Gael and Essid, Slim}, booktitle = {The 2025 Conference on Empirical Methods in Natural Language Processing}, year = {2025}, }
ANNEALED MULTIPLE CHOICE LEARNING: OVERCOMING LIMITATIONS OF WINNER-TAKES-ALL WITH ANNEALING

D. Perera, V. Letzelter, T. Mariotte, A. Cortés, M. Chen, S. Essid, and G. Richard

In Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , Dec 2024

Abs Bib PDF

We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets on the standard UCI benchmark, and on speech separation.
@inproceedings{DP:NeurIPS-24, title = {Annealed Multiple Choice Learning: Overcoming Limitations of Winner-Takes-All with Annealing}, author = {Perera, David and Letzelter, Victor and Mariotte, Théo and Cortés, Adrien and Chen, Mickael and Essid, Slim and Richard, Gaël}, booktitle = {Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)}, address = {Vancouver, Canada}, year = {2024}, month = dec, }
AN EYE FOR AN EAR: ZERO-SHOT AUDIO DESCRIPTION LEVERAGING AN IMAGE CAPTIONER WITH AUDIO-VISUAL TOKEN DISTRIBUTION MATCHING

H. Malard, M. Olvera, S. Lathuilière, and S. Essid

In Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , Dec 2024

Abs Bib PDF

Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment. Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner. Our approach aligns the audio token distribution with that of the image tokens, enabling the model to perform zero-shot audio captioning in an unsupervised fashion. This alignment allows for the use of either audio or audiovisual input by combining or substituting the image encoder with the aligned audio encoder. Our method achieves significantly improved performances in zero-shot audio captioning, compared to existing approaches.
@inproceedings{HM:NeurIPS-24, title = {An Eye for an Ear: Zero-Shot Audio Description Leveraging an Image Captioner with Audio-Visual Token Distribution Matching}, author = {Malard, Hugo and Olvera, Michel and Lathuilière, Stéphane and Essid, Slim}, booktitle = {Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)}, address = {Vancouver, Canada}, year = {2024}, month = dec, }
SPEECH SELF-SUPERVISED REPRESENTATIONS BENCHMARKING: A CASE FOR LARGER PROBING HEADS

S. Zaiem, Y. Kemiche, T. Parcollet, S. Essid, and M. Ravanelli

Computer Speech & Language, Dec 2024

Abs Bib

Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization, and multi-level feature exploitation.
@article{ZAIEM2025101695, title = {Speech self-supervised representations benchmarking: A case for larger probing heads}, journal = {Computer Speech & Language}, volume = {89}, pages = {101695}, year = {2024}, issn = {0885-2308}, doi = {https://doi.org/10.1016/j.csl.2024.101695}, url = {https://www.sciencedirect.com/science/article/pii/S0885230824000780}, author = {Zaiem, Salah and Kemiche, Youcef and Parcollet, Titouan and Essid, Slim and Ravanelli, Mirco}, keywords = {Self-supervised learning, Speech processing, Representation learning}, }
WINNER-TAKES-ALL LEARNERS ARE GEOMETRY-AWARE CONDITIONAL DENSITY ESTIMATORS

V. Letzelter, D. Perera, C. Rommel, M. Fontaine, S. Essid, G. Richard, and P. Pérez

In International Conference on Machine Learning (ICML 2024) , Jul 2024

Abs Bib

Winner-takes-all training is a simple learning paradigm, in which the multiple predictions of so-called hypotheses are leveraged to tackle ambiguous tasks. Recently, a connection was established between winner-takes-all training and centroidal Voronoi tessellations, showing that, once trained, the hypotheses should quantize optimally the shape of the conditional distribution to predict. However, probabilistic reliability guarantees for the predictions are missing. In this work, we show how to take advantage of the appealing geometrical properties of the winner-takes-all learners for conditional density estimation, without modifying its original training scheme. We then discuss the competitiveness of our estimator based on novel theoretical and experimental results on both synthetic and audio data.
@inproceedings{Letzelter_Perera_Rommel_Fontaine_Essid_Richard_Pérez_2024, title = {Winner-takes-all learners are geometry-aware conditional density estimators}, url = {https://hal.science/hal-04574640/}, booktitle = {International Conference on Machine Learning (ICML 2024)}, author = {Letzelter, Victor and Perera, David and Rommel, Cédric and Fontaine, Mathieu and Essid, Slim and Richard, Gael and Pérez, Patrick}, address = {Vienna, Austria}, month = jul, year = {2024} }
COLLABORATING FOUNDATION MODELS FOR DOMAIN GENERALIZED SEMANTIC SEGMENTATION

Y. Benigmim, S. Roy, S. Essid, V. Kalogeiton, and S. Lathuilière

In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024) , Jul 2024

Abs Bib PDF

Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work, we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP backbone for its robust feature representation, (ii) generative models to diversify the content, thereby covering various modes of the possible target distribution, and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions, notably outperforming prior methods by 5.6% and 6.7% on averaged miou, respectively.
@inproceedings{benigmim2023collaborating, title = {Collaborating Foundation models for Domain Generalized Semantic Segmentation}, author = {Benigmim, Yasser and Roy, Subhankar and Essid, Slim and Kalogeiton, Vicky and Lathuilière, Stéphane}, booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)}, year = {2024}, eprint = {2312.09788}, archiveprefix = {arXiv}, primaryclass = {cs.CV}, }
SELF-SUPERVISED LEARNING OF MULTI-LEVEL AUDIO REPRESENTATIONS FOR MUSIC SEGMENTATION

M. Buisson, B. Mcfee, S. Essid, and H. Crayencour

IEEE/ACM Transactions on Audio, Speech and Language Processing, Mar 2024

Abs Bib PDF

The task of music structure analysis refers to automatically identifying the location and the nature of musical sections within a song. In the supervised scenario, structural annotations generally result from exhaustive data collection processes, which represents one of the main challenges of this task. Moreover, both the subjectivity of music structure and the hierarchical characteristics it exhibits make the obtained structural annotations not fully reliable, in the sense that they do not convey a "universal ground-truth" unlike other tasks in music information retrieval. On the other hand, the quickly growing quantity of available music data has enabled weakly supervised and self-supervised approaches to achieve impressive results on a wide range of music-related problems. In this work, a self-supervised learning method is proposed to learn robust multi-level music representations prior to structural segmentation using contrastive learning. To this end, sets of frames sampled at different levels of detail are used to train a deep neural network in a disentangled manner. The proposed method is evaluated on both flat and multi-level segmentation. We show that each distinct sub-region of the output embeddings can efficiently account for structural similarity at their own targeted level of detail, which ultimately improves performance of downstream flat and multi-level segmentation. Finally, complementary experiments are carried out to study how the obtained representations can be further adapted to specific datasets using a supervised fine-tuning objective in order to facilitate structure retrieval in domains where human annotations remain scarce.
@article{buisson:hal-04485065, title = {{Self-Supervised Learning of Multi-level Audio Representations for Music Segmentation}}, author = {Buisson, Morgan and Mcfee, Brian and Essid, Slim and Crayencour, Helene-Camille}, journal = {IEEE/ACM Transactions on Audio, Speech and Language Processing}, url = {https://hal.science/hal-04485065}, year = {2024}, month = mar, keywords = {Music;Annotations;Task analysis;Training;Feature extraction;Self-supervised learning;Artificial neural networks;Music structure analysis;structural segmentation;representation learning}, hal_version = {v1}, volume = {32}, pages = {2141-2152}, doi = {10.1109/TASLP.2024.3379894}, }
RESILIENT MULTIPLE CHOICE LEARNING: A LEARNED SCORING SCHEME WITH APPLICATION TO AUDIO SCENE ANALYSIS

V. Letzelter, M. Fontaine, P. Perez, G. Richard, S. Essid, and M. Chen

In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) , Dec 2023

Abs Bib PDF

We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.
@inproceedings{letzelter:hal-04216055, title = {Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis}, author = {Letzelter, Victor and Fontaine, Mathieu and Perez, Patrick and Richard, Gael and Essid, Slim and Chen, Mickael}, url = {https://hal.science/hal-04216055}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)}, address = {New Orleans, United States}, year = {2023}, month = dec, hal_id = {hal-04216055}, hal_version = {v1}, }
PRETEXT TASKS SELECTION FOR MULTITASK SELF-SUPERVISED AUDIO REPRESENTATION LEARNING

S. Zaiem, T. Parcollet, S. Essid, and A. Heba

IEEE Journal of Selected Topics in Signal Processing, Dec 2022

Abs Bib PDF

Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.
@article{9846981, author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim and Heba, Abdelwahab}, journal = {IEEE Journal of Selected Topics in Signal Processing}, title = {Pretext Tasks Selection for Multitask Self-Supervised Audio Representation Learning}, year = {2022}, volume = {16}, number = {6}, pages = {1439-1453}, doi = {10.1109/JSTSP.2022.3195430}, }
DNN-BASED MASK ESTIMATION FOR DISTRIBUTED SPEECH ENHANCEMENT IN SPATIALLY UNCONSTRAINED MICROPHONE ARRAYS

N. Furnon, R. Serizel, S. Essid, and I. Illina

IEEE/ACM Transactions on Audio, Speech and Language Processing, Dec 2021

Abs Bib PDF

Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to extend a previously introduced distributed DNN-based time-frequency mask estimation scheme that can efficiently use spatial information in form of so-called compressed signals which are pre-filtered target estimations. We study the performance of this algorithm named Tango under realistic acoustic conditions and investigate practical aspects of its optimal application. We show that the nodes in the microphone array cooperate by taking profit of their spatial coverage in the room. We also propose to use the compressed signals not only to convey the target estimation but also the noise estimation in order to exploit the acoustic diversity recorded throughout the microphone array.
@article{furnon:hal-02985867, title = {DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays}, author = {Furnon, Nicolas and Serizel, Romain and Essid, Slim and Illina, Irina}, url = {https://hal.archives-ouvertes.fr/hal-02985867}, journal = {IEEE/ACM Transactions on Audio, Speech and Language Processing}, volume = {29}, pages = {2310 - 2323}, year = {2021}, doi = {10.1109/TASLP.2021.3092838}, hal_id = {hal-02985867}, hal_version = {v3}, }
WEAKLY SUPERVISED REPRESENTATION LEARNING FOR AUDIO-VISUAL SCENE ANALYSIS

S. Parekh, S. Essid, A. Ozerov, N. Duong, P. Pérez, and G. Richard

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Dec 2019

Abs Bib PDF

Audiovisual (AV) representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. Specifically, we develop methods that identify events and localize corresponding AV cues in unconstrained videos. Importantly, this is done using weak labels where only video-level event labels are known without any information about their location in time. We show that the learnt representations are useful for performing several tasks such as event/object classification, audio event detection, audio source separation and visual object localization. An important feature of our method is its capacity to learn from unsynchronized audiovisual events. We also demonstrate our framework’s ability to separate out the audio source of interest through a novel use of nonnegative matrix factorization. State-of-the-art classification results, with a F1-score of 65.0, are achieved on DCASE 2017 smart cars challenge data with promising generalization to diverse object types such as musical instruments. Visualizations of localized visual regions and audio segments substantiate our system’s efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously.
@article{8926380, author = {Parekh, S. and Essid, Slim and Ozerov, A. and Duong, N. Q. K. and Pérez, P. and Richard, G.}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title = {Weakly Supervised Representation Learning for Audio-Visual Scene Analysis}, year = {2019}, volume = {28}, number = {}, pages = {416-428}, }