Machine Learning, Artificial Intelligence and Signal Processing, especially:
multimodal and multiview learning;
representation learning, in particular self-supervised learning;
structured prediction;
with applications to:
multimodal language models, especially audio-vision language models;
speech processing, machine listening, music content analysis (MIR);
multimodal perception, social and affective computing;
physiological, especially EEG, data analysis.
For more information about my research activities check my publications. You can also read about the research projects I have been involved in, including those of the PhD students and post-docs I have advised.
News
Feb. 2nd 2025: Our PhD student David Perera successfully defended his thesis.
Nov. 6th 2024: Our PhD student Morgan Buisson successfully defended his thesis.
Sep. 25th 2024: 2 papers accepted at NeurIPS 2024.
Short bio
Slim Essid is Full Professor of Télécom Paris and the coordinator of the Audio Data Analysis and Signal Processing (ADASP) group. He received the state engineering degree from the École Nationale d’Ingénieurs de Tunis in 2001; the M.Sc. (D.E.A.) degree in digital communication systems from the École Nationale Supérieure des Télécommunications, Paris, France, in 2002; the Ph.D. degree from the Université Pierre et Marie Curie (UPMC), in 2005; and the habilitation (HDR) degree from UPMC in 2015.
Over the past 15 years, he has been involved in various French and European research projects. He has collaborated with 14 post-docs and has graduated 15 PhD students; he is currently co-advising 10 others. He has published over 150 peer-reviewed conference and journal papers with more than 100 distinct co-authors. On a regular basis he serves as a reviewer for various machine learning, signal processing, audio and multimedia conferences and journals, for instance various IEEE transactions, and as an expert for research funding agencies.
Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.
@misc{malard2025tacotrainingfreesoundprompted,title={TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization},author={Malard, Hugo and Olvera, Michel and Lathuiliere, Stephane and Essid, Slim},year={2025},eprint={2412.01488},archiveprefix={arXiv},primaryclass={eess.AS},note={Pre-print}}
ANNEALED MULTIPLE CHOICE LEARNING: OVERCOMING LIMITATIONS OF WINNER-TAKES-ALL WITH ANNEALING
D. Perera, V. Letzelter, T. Mariotte, A. Cortes, G. Richard, S. Essid, and M. Chen
In Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , Dec 2024
We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets on the standard UCI benchmark, and on speech separation.
@inproceedings{DP:NeurIPS-24,title={Annealed Multiple Choice Learning: Overcoming Limitations of Winner-Takes-All with Annealing},author={Perera, David and Letzelter, Victor and Mariotte, Theo and Cortes, Adrien and Richard, Gael and Essid, Slim and Chen, Mickael},booktitle={Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)},address={Vancouver, Canada},year={2024},month=dec,}
AN EYE FOR AN EAR: ZERO-SHOT AUDIO DESCRIPTION LEVERAGING AN IMAGE CAPTIONER WITH AUDIO-VISUAL TOKEN DISTRIBUTION MATCHING
H. Malard, M. Olvera, S. Lathuilière, and S. Essid
In Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , Dec 2024
Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment. Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner. Our approach aligns the audio token distribution with that of the image tokens, enabling the model to perform zero-shot audio captioning in an unsupervised fashion. This alignment allows for the use of either audio or audiovisual input by combining or substituting the image encoder with the aligned audio encoder. Our method achieves significantly improved performances in zero-shot audio captioning, compared to existing approaches.
@inproceedings{HM:NeurIPS-24,title={An Eye for an Ear: Zero-Shot Audio Description Leveraging an Image Captioner with Audio-Visual Token Distribution Matching},author={Malard, Hugo and Olvera, Michel and Lathuilière, Stéphane and Essid, Slim},booktitle={Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)},address={Vancouver, Canada},year={2024},month=dec,}
SPEECH SELF-SUPERVISED REPRESENTATIONS BENCHMARKING: A CASE FOR LARGER PROBING HEADS
S. Zaiem, Y. Kemiche, T. Parcollet, S. Essid, and M. Ravanelli
Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization, and multi-level feature exploitation.
@article{ZAIEM2025101695,title={Speech self-supervised representations benchmarking: A case for larger probing heads},journal={Computer Speech & Language},volume={89},pages={101695},year={2024},issn={0885-2308},doi={https://doi.org/10.1016/j.csl.2024.101695},url={https://www.sciencedirect.com/science/article/pii/S0885230824000780},author={Zaiem, Salah and Kemiche, Youcef and Parcollet, Titouan and Essid, Slim and Ravanelli, Mirco},keywords={Self-supervised learning, Speech processing, Representation learning},}
WINNER-TAKES-ALL LEARNERS ARE GEOMETRY-AWARE CONDITIONAL DENSITY ESTIMATORS
V. Letzelter, D. Perera, C. Rommel, M. Fontaine, S. Essid, G. Richard, and P. Pérez
In International Conference on Machine Learning (ICML 2024) , Jul 2024
Winner-takes-all training is a simple learning paradigm, in which the multiple predictions of so-called hypotheses are leveraged to tackle ambiguous tasks. Recently, a connection was established between winner-takes-all training and centroidal Voronoi tessellations, showing that, once trained, the hypotheses should quantize optimally the shape of the conditional distribution to predict. However, probabilistic reliability guarantees for the predictions are missing. In this work, we show how to take advantage of the appealing geometrical properties of the winner-takes-all learners for conditional density estimation, without modifying its original training scheme. We then discuss the competitiveness of our estimator based on novel theoretical and experimental results on both synthetic and audio data.
@inproceedings{Letzelter_Perera_Rommel_Fontaine_Essid_Richard_Pérez_2024,title={Winner-takes-all learners are geometry-aware conditional density estimators},url={https://hal.science/hal-04574640/},booktitle={International Conference on Machine Learning (ICML 2024)},author={Letzelter, Victor and Perera, David and Rommel, Cédric and Fontaine, Mathieu and Essid, Slim and Richard, Gael and Pérez, Patrick},address={Vienna, Austria},month=jul,year={2024}}
COLLABORATING FOUNDATION MODELS FOR DOMAIN GENERALIZED SEMANTIC SEGMENTATION
Y. Benigmim, S. Roy, S. Essid, V. Kalogeiton, and S. Lathuilière
In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024) , Jul 2024
Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work, we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP backbone for its robust feature representation, (ii) generative models to diversify the content, thereby covering various modes of the possible target distribution, and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions, notably outperforming prior methods by 5.6% and 6.7% on averaged miou, respectively.
@inproceedings{benigmim2023collaborating,title={Collaborating Foundation models for Domain Generalized Semantic Segmentation},author={Benigmim, Yasser and Roy, Subhankar and Essid, Slim and Kalogeiton, Vicky and Lathuilière, Stéphane},booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)},year={2024},eprint={2312.09788},archiveprefix={arXiv},primaryclass={cs.CV},}
SELF-SUPERVISED LEARNING OF MULTI-LEVEL AUDIO REPRESENTATIONS FOR MUSIC SEGMENTATION
M. Buisson, B. Mcfee, S. Essid, and H. Crayencour
IEEE/ACM Transactions on Audio, Speech and Language Processing, Mar 2024
The task of music structure analysis refers to automatically identifying the location and the nature of musical sections within a song. In the supervised scenario, structural annotations generally result from exhaustive data collection processes, which represents one of the main challenges of this task. Moreover, both the subjectivity of music structure and the hierarchical characteristics it exhibits make the obtained structural annotations not fully reliable, in the sense that they do not convey a "universal ground-truth" unlike other tasks in music information retrieval. On the other hand, the quickly growing quantity of available music data has enabled weakly supervised and self-supervised approaches to achieve impressive results on a wide range of music-related problems. In this work, a self-supervised learning method is proposed to learn robust multi-level music representations prior to structural segmentation using contrastive learning. To this end, sets of frames sampled at different levels of detail are used to train a deep neural network in a disentangled manner. The proposed method is evaluated on both flat and multi-level segmentation. We show that each distinct sub-region of the output embeddings can efficiently account for structural similarity at their own targeted level of detail, which ultimately improves performance of downstream flat and multi-level segmentation. Finally, complementary experiments are carried out to study how the obtained representations can be further adapted to specific datasets using a supervised fine-tuning objective in order to facilitate structure retrieval in domains where human annotations remain scarce.
@article{buisson:hal-04485065,title={{Self-Supervised Learning of Multi-level Audio Representations for Music Segmentation}},author={Buisson, Morgan and Mcfee, Brian and Essid, Slim and Crayencour, Helene-Camille},journal={IEEE/ACM Transactions on Audio, Speech and Language Processing},url={https://hal.science/hal-04485065},year={2024},month=mar,keywords={Music;Annotations;Task analysis;Training;Feature extraction;Self-supervised learning;Artificial neural networks;Music structure analysis;structural segmentation;representation learning},hal_version={v1},volume={32},pages={2141-2152},doi={10.1109/TASLP.2024.3379894},}
RESILIENT MULTIPLE CHOICE LEARNING: A LEARNED SCORING SCHEME WITH APPLICATION TO AUDIO SCENE ANALYSIS
V. Letzelter, M. Fontaine, P. Perez, G. Richard, S. Essid, and M. Chen
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) , Dec 2023
We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.
@inproceedings{letzelter:hal-04216055,title={Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis},author={Letzelter, Victor and Fontaine, Mathieu and Perez, Patrick and Richard, Gael and Essid, Slim and Chen, Mickael},url={https://hal.science/hal-04216055},booktitle={Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)},address={New Orleans, United States},year={2023},month=dec,hal_id={hal-04216055},hal_version={v1},}
PRETEXT TASKS SELECTION FOR MULTITASK SELF-SUPERVISED AUDIO REPRESENTATION LEARNING
S. Zaiem, T. Parcollet, S. Essid, and A. Heba
IEEE Journal of Selected Topics in Signal Processing, Dec 2022
Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.
@article{9846981,author={Zaiem, Salah and Parcollet, Titouan and Essid, Slim and Heba, Abdelwahab},journal={IEEE Journal of Selected Topics in Signal Processing},title={Pretext Tasks Selection for Multitask Self-Supervised Audio Representation Learning},year={2022},volume={16},number={6},pages={1439-1453},doi={10.1109/JSTSP.2022.3195430},}
DNN-BASED MASK ESTIMATION FOR DISTRIBUTED SPEECH ENHANCEMENT IN SPATIALLY UNCONSTRAINED MICROPHONE ARRAYS
N. Furnon, R. Serizel, S. Essid, and I. Illina
IEEE/ACM Transactions on Audio, Speech and Language Processing, Dec 2021
Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to extend a previously introduced distributed DNN-based time-frequency mask estimation scheme that can efficiently use spatial information in form of so-called compressed signals which are pre-filtered target estimations. We study the performance of this algorithm named Tango under realistic acoustic conditions and investigate practical aspects of its optimal application. We show that the nodes in the microphone array cooperate by taking profit of their spatial coverage in the room. We also propose to use the compressed signals not only to convey the target estimation but also the noise estimation in order to exploit the acoustic diversity recorded throughout the microphone array.
@article{furnon:hal-02985867,title={DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays},author={Furnon, Nicolas and Serizel, Romain and Essid, Slim and Illina, Irina},url={https://hal.archives-ouvertes.fr/hal-02985867},journal={IEEE/ACM Transactions on Audio, Speech and Language Processing},volume={29},pages={2310 - 2323},year={2021},doi={10.1109/TASLP.2021.3092838},hal_id={hal-02985867},hal_version={v3},}
WEAKLY SUPERVISED REPRESENTATION LEARNING FOR AUDIO-VISUAL SCENE ANALYSIS
S. Parekh, S. Essid, A. Ozerov, N. Duong, P. Pérez, and G. Richard
IEEE/ACM Transactions on Audio, Speech, and Language Processing, Dec 2019
Audiovisual (AV) representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. Specifically, we develop methods that identify events and localize corresponding AV cues in unconstrained videos. Importantly, this is done using weak labels where only video-level event labels are known without any information about their location in time. We show that the learnt representations are useful for performing several tasks such as event/object classification, audio event detection, audio source separation and visual object localization. An important feature of our method is its capacity to learn from unsynchronized audiovisual events. We also demonstrate our framework’s ability to separate out the audio source of interest through a novel use of nonnegative matrix factorization. State-of-the-art classification results, with a F1-score of 65.0, are achieved on DCASE 2017 smart cars challenge data with promising generalization to diverse object types such as musical instruments. Visualizations of localized visual regions and audio segments substantiate our system’s efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously.
@article{8926380,author={Parekh, S. and Essid, Slim and Ozerov, A. and Duong, N. Q. K. and Pérez, P. and Richard, G.},journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},title={Weakly Supervised Representation Learning for Audio-Visual Scene Analysis},year={2019},volume={28},number={},pages={416-428},}
Contact
Télécom Paris - Room 5C 19, place Marguerite Perey 91120 Palaiseau - FRANCE Indications on how to get there can be found here.