PUBLICATIONS | Slim Essid's research page

2026

MULTIPLE CHOICE LEARNING OF LOW-RANK ADAPTERS FOR LANGUAGE MODELING

V. Letzelter, H. Malard, M. Fontaine, G. Richard, S. Essid, A. Bursuc, and P. Pérez

In Forty-third International Conference on Machine Learning (ICML) , 2026

Abs Bib PDF

We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple “futures” may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All loss to efficiently handle ambiguity through Low-Rank Adaptation. We provide a theoretical interpretation of applying MCL to language modeling, assuming the data is generated from a mixture of distributions. We illustrate the proposed approach using mixtures of Markov chains. We then demonstrate with experiments on visual and audio captioning, as well as machine translation, that our method achieves high diversity and relevance in generated outputs.
@inproceedings{VL:ICML-26, title = {Multiple Choice Learning of Low-Rank Adapters for Language Modeling}, author = {Letzelter, Victor and Malard, Hugo and Fontaine, Mathieu and Richard, Gaël and Essid, Slim and Bursuc, Andrei and Pérez, Patrick}, booktitle = {Forty-third International Conference on Machine Learning (ICML)}, year = {2026}, url = {https://openreview.net/forum?id=CCO35e4DCO}, }
TACO: TRAINING-FREE SOUND PROMPTED SEGMENTATION VIA SEMANTICALLY CONSTRAINED AUDIO-VISUAL CO-FACTORIZATION

H. Malard, M. Olvera, S. Lathuiliere, and S. Essid

Transactions on Machine Learning Research, 2026

Abs Bib PDF

Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.
@article{HM:TMLR:2026, title = {TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization}, journal = {Transactions on Machine Learning Research}, year = {2026}, author = {Malard, Hugo and Olvera, Michel and Lathuiliere, Stephane and Essid, Slim}, }
TINYMU: A COMPACT AUDIO-LANGUAGE MODEL FOR MUSIC UNDERSTANDING

X. Li, A. Quelennec, and S. Essid

In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2026

Abs Bib

Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answering music-related questions by following user instructions. However, their massive scale, often billions of parameters, results in expensive training, slow inference, and limited deployability on edge devices. In this work, we present TinyMU, a lightweight (229M) Music-Language Model (MLM) that achieves performance comparable to much larger LALMs while remaining efficient and compact. To train TinyMU, we introduce MusicSkills-3.5M, a carefully curated, music-grounded question-answering dataset with 3.5M samples. Spanning multiple-choice, binary, and open-ended formats, this dataset provides fine-grained supervision across diverse musical concepts. For its architecture, TinyMU leverages MATPAC++, the SOTA self-supervised audio encoder for fine-grained feature extraction. Paired with a lightweight linear projector, it efficiently aligns audio embeddings with the language model. Through extensive evaluation, we show that TinyMU performs strongly in both basic music understanding and complex reasoning. Notably, on the MuChoMusic benchmark, it achieves 82% of SOTA LALM’s performance despite being 35× smaller, highlighting the potential of small MLMs under constrained computational budgets.
@inproceedings{MA-ICASSP-26, title = {{TINYMU: A Compact Audio-Language Model for Music Understanding}}, author = {Li, Xiquan and Quelennec, Aurian and Essid, Slim}, booktitle = {ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, address = {Barcelona, Spain}, year = {2026}, month = may }
S-SONDO: SELF-SUPERVISED KNOWLEDGE DISTILLATION FOR GENERAL AUDIO FOUNDATION MODELS

M. El Adlouni, A. Quelennec, P. Chouteau, G. Peeters, and S. Essid

In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2026

Abs Bib

General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (\textbfSelf-\textbfSupervised Kn\textbfOwledge Distillatio\textbfN for General Au\textbfDio F\textbfOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61x smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling.
@inproceedings{MA-ICASSP-27, title = {{S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models}}, author = {El Adlouni, Mohammed Ali and Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim}, booktitle = {ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, address = {Barcelona, Spain}, year = {2026}, month = may }

2025

CONTROLLING CONTRASTIVE SELF-SUPERVISED LEARNING WITH KNOWLEDGE-DRIVEN MULTIPLE HYPOTHESES: APPLICATION TO BEAT TRACKING

A. Gagnere, S. Essid, and G. Peeters

May 2025

Bib

@misc{gagnere2025controllingcontrastiveselfsupervisedlearning,
  title = {Controlling Contrastive Self-Supervised Learning with Knowledge-Driven Multiple Hypotheses: Application to Beat Tracking},
  author = {Gagnere, Antonin and Essid, Slim and Peeters, Geoffroy},
  year = {2025},
  eprint = {2510.25560},
  archiveprefix = {arXiv},
  primaryclass = {cs.SD},
  url = {https://arxiv.org/abs/2510.25560}
}

MULTIPLE CHOICE LEARNING OF LOW RANK ADAPTERS FOR LANGUAGE MODELING

V. Letzelter, H. Malard, M. Fontaine, G. Richard, S. Essid, A. Bursuc, and P. Pérez

May 2025

Pre-print

Abs Bib PDF

We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple futures may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All (WTA) loss to efficiently handle ambiguity through Low-Rank Adaptation (LoRA). We provide a theoretical interpretation of applying Multiple Choice Learning to Language Modeling, assuming the data is generated from a mixture of distributions. To illustrate the proposed approach, we use data sampled from mixtures of Markov chains. We then demonstrate with extensive experiments on real-world visual and audio captioning tasks that our method achieves high diversity and relevance in generated outputs.
@misc{letzelter2025multiplechoicelearninglow, title = {Multiple Choice Learning of Low Rank Adapters for Language Modeling}, author = {Letzelter, Victor and Malard, Hugo and Fontaine, Mathieu and Richard, Gaël and Essid, Slim and Bursuc, Andrei and Pérez, Patrick}, year = {2025}, eprint = {2507.10419}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, note = {Pre-print} }
MATPAC++: ENHANCED MASKED LATENT PREDICTION FOR SELF-SUPERVISED AUDIO REPRESENTATION LEARNING

A. Quelennec, P. Chouteau, G. Peeters, and S. Essid

May 2025

Pre-print

Abs Bib PDF

Masked latent prediction has emerged as a leading paradigm in self-supervised learning (SSL), especially for general audio and music representation learning. While recent methods have demonstrated strong performance, the role of the predictor module used at the output of such SSL systems remains mainly overlooked, despite being crucial for solving the pretext task at hand. In particular, this module should be able to deal with the ambiguity inherent in audio content, especially when it is composed of multiple sound sources. This work proposes a novel enhancement: integrating Multiple Choice Learning (MCL) to explicitly model prediction ambiguity and improve representation quality. We build on top of the recently proposed MATPAC system, improving its prediction and unsupervised classification pretext tasks with MCL. We extensively evaluate our method, MATPAC++, through both linear probing across multiple downstream tasks and fine-tuning on AudioSet, employing a unified protocol that enables rigorous and fair comparisons with state-of-the-art SSL approaches. Results show that our proposal achieves state-of-the-art when fine-tuned on AudioSet and overall state-of-the-art scores on downstream tasks. Additionally, we examine domain specialisation by training exclusively on music data, where our model achieves state-of-the-art performance with significantly improved efficiency.
@misc{quelennec2025matpacenhancedmaskedlatent, title = {MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning}, author = {Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim}, year = {2025}, eprint = {2508.12709}, archiveprefix = {arXiv}, primaryclass = {cs.SD}, note = {Pre-print} }
IKNOW-AUDIO: INTEGRATING KNOWLEDGE GRAPHS WITH AUDIO-LANGUAGE MODELS

M. Olvera, C. Wang, P. Stamatiadis, G. Richard, and S. Essid

In The 2025 Conference on Empirical Methods in Natural Language Processing , May 2025

Abs Bib PDF

Contrastive Language–Audio Pretraining (CLAP) models learn by aligning audio and text in a shared embedding space, enabling powerful zero-shot recognition. However, their performance is highly sensitive to prompt formulation and language nuances, and they often inherit semantic ambiguities and spurious correlations from noisy pretraining data. While prior work has explored prompt engineering, adapters, and prefix tuning to address these limitations, the use of structured prior knowledge remains largely unexplored. We present iKnow-audio, a framework that integrates knowledge graphs with audio-language models to provide robust semantic grounding. iKnow-audio builds on the Audio-centric Knowledge Graph (AKG), which encodes ontological relations comprising semantic, causal, and taxonomic connections reflective of everyday sound scenes and events. By training knowlege graph embedding models on the AKG and refining CLAP predictions through this structured knowledge, iKnow-audio improves disambiguation of acoustically similar sounds and reduces reliance on prompt engineering. Comprehensive zero-shot evaluations across six benchmark datasets demonstrate consistent gains over baseline CLAP, supported by embedding-space analyses that highlight improved relational grounding.
@inproceedings{olvera2025iknowaudio, title = {iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models}, author = {Olvera, Michel and Wang, Changhong and Stamatiadis, Paraskevas and Richard, Gael and Essid, Slim}, booktitle = {The 2025 Conference on Empirical Methods in Natural Language Processing}, year = {2025}, }
MTSE: MULTI-TARGET SPEAKER EXTRACTION FOR CONVERSATION SCENARIOS

T. Serre, M. Fontaine, E. Benhaim, and S. Essid

In Interspeech , Aug 2025

Abs Bib

Target Speaker Extraction (TSE) aims to capture a desired voice among other interfering ones and/or background noise using a reference excerpt acquired during the enrollment phase. While useful in many applications, existing TSE systems cannot handle the scenario where several voices need to be enrolled and/or targeted. In this work, we address this new task, called multi-target speaker extraction (MTSE), which consists of extracting multiple target speakers in a mixture, possibly involving other interfering voices, using multiple speaker embeddings. Such models can thus be used by multiple users without the re-enrollment necessity. We propose a curriculum learning scheme to adapt well-known TSE models to the MTSE task. We prove its effectiveness and obtain, for the first time, successful MTSE results on meeting-type excerpts. We also present single-speaker TSE results with multiple enrolled speakers, proving the robustness and versatility of our solution.
@inproceedings{serre:interspeech25, title = {MTSE: Multi-Target Speaker Extraction for Conversation Scenarios}, author = {Serre, Thomas and Fontaine, Mathieu and Benhaim, Eric and Essid, Slim}, booktitle = {Interspeech}, address = {Rotterdam, The Netherlands}, year = {2025}, month = aug }
IS³: GENERIC IMPULSIVE–STATIONARY SOUND SEPARATION IN ACOUSTIC SCENES USING DEEP FILTERING

C. Berger, R. Badeau, and S. Essid

In WASPAA 2025 – IEEE Workshop on Applications of Signal Processing to Audio and Acoustics , Oct 2025

Abs Bib

We are interested in audio systems capable of performing a differentiated processing of stationary backgrounds and isolated acoustic events within an acoustic scene, whether for applying specific processing methods to each part or for focusing solely on one while ignoring the other. Such systems have applications in real-world scenarios, including robust adaptive audio rendering systems (e.g., EQ or compression), plosive attenuation in voice mixing, noise suppression or reduction, robust acoustic event classification or even bioacoustics. To this end, we introduce IS³, a neural network designed for Impulsive–Stationary Sound Separation, that isolates impulsive acoustic events from the stationary background using a deep filtering approach, that can act as a pre-processing stage for the above-mentioned tasks. To ensure optimal training, we propose a sophisticated data generation pipeline that curates and adapts existing datasets for this task. We demonstrate that a learning-based approach, build on a relatively lightweight neural architecture and trained with well-designed and varied data, is successful in this previously unaddressed task, outperforming the Harmonic–Percussive Sound Separation masking method, adapted from music signal processing research, and wavelet filtering on objective separation metrics.
@inproceedings{CB-WASPAA-25, title = {{IS³: Generic Impulsive--Stationary Sound Separation in Acoustic Scenes using Deep Filtering}}, author = {Berger, Clémentine and Badeau, Roland and Essid, Slim}, booktitle = {WASPAA 2025 -- IEEE Workshop on Applications of Signal Processing to Audio and Acoustics}, address = {Granlibakken Tahoe, Tahoe City, CA, USA}, year = {2025}, month = oct }
TACO: TRAINING-FREE SOUND PROMPTED SEGMENTATION VIA SEMANTICALLY CONSTRAINED AUDIO-VISUAL CO-FACTORIZATION

H. Malard, M. Olvera, S. Lathuiliere, and S. Essid

Oct 2025

Pre-print

Abs Bib PDF

Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.
@misc{malard2025tacotrainingfreesoundprompted, title = {TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization}, author = {Malard, Hugo and Olvera, Michel and Lathuiliere, Stephane and Essid, Slim}, year = {2025}, eprint = {2412.01488}, archiveprefix = {arXiv}, primaryclass = {eess.AS}, note = {Pre-print} }
MASKED LATENT PREDICTION AND CLASSIFICATION FOR SELF-SUPERVISED AUDIO REPRESENTATION LEARNING

A. Quelennec, P. Chouteau, G. Peeters, and S. Essid

In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Apr 2025

Abs Bib PDF

Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (MATPAC), which is trained with two pretext tasks solved jointly. As in previous work, the first pretext task is a masked latent prediction task, ensuring a robust input representation in the latent space. The second one is unsupervised classification, which utilises the latent representations of the first pretext task to match probability distributions between a teacher and a student. We validate the MATPAC method by comparing it to other state-of-the-art proposals and conducting ablations studies. MATPAC reaches state-of-the-art self-supervised learning results on reference audio classification datasets such as OpenMIC, GTZAN, ESC-50 and US8K and outperforms comparable supervised methods results for musical auto-tagging on Magna-tag-a-tune.
@inproceedings{quelennec:hal-04921274, title = {{Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning}}, author = {Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim}, booktitle = {ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, address = {Hyderabad, India}, year = {2025}, month = apr, keywords = {self-supervised audio representation learning audio spectrogram transformers ; self-supervised ; audio representation learning ; audio spectrogram transformers}, hal_id = {hal-04921274}, hal_version = {v1} }

EEG–METABOLIC COUPLING AND TIME LIMIT AT VO2MAX DURING CONSTANT-LOAD EXERCISE

L. Poinsard, C. Berthomier, M. Clémençon, M. Brandewinder, S. Essid, C. Damon, F. Rigaud, A. Bénichoux, E. Maby, L. Fornoni, P. Bouchet, P. Beers, B. Massot, P. Revol, T. Creveaux, C. Collet, J. Mattout, V. Pialoux, and V. Billat

Journal of Functional Morphology and Kinesiology, Sep 2025

Bib PDF

@article{poinsard:hal-05321912,
  title = {{EEG--Metabolic Coupling and Time Limit at VO2max During Constant-Load Exercise}},
  author = {Poinsard, Luc and Berthomier, Christian and Cl{\'e}men{\c c}on, Michel and Brandewinder, Marie and Essid, Slim and Damon, C{\'e}cilia and Rigaud, Fran{\c c}ois and B{\'e}nichoux, Alexis and Maby, Emmanuel and Fornoni, Lesly and Bouchet, Patrick and Beers, Pascal Van and Massot, Bertrand and Revol, Patrice and Creveaux, Thomas and Collet, Christian and Mattout, J{\'e}r{\'e}mie and Pialoux, Vincent and Billat, V{\'e}ronique},
  url = {https://hal.science/hal-05321912},
  journal = {{Journal of Functional Morphology and Kinesiology}},
  publisher = {{MDPI}},
  volume = {10},
  number = {4},
  pages = {369-1:369-25},
  year = {2025},
  month = sep,
  doi = {10.3390/jfmk10040369},
  keywords = {endurance ; VO2max ; time limit ; high-intensity exercise ; exhaustion ; electroencephalography},
  hal_id = {hal-05321912},
  hal_version = {v1}
}

O-EENC-SD: EFFICIENT ONLINE END-TO-END NEURAL CLUSTERING FOR SPEAKER DIARIZATION

E. Gruttadauria, M. Fontaine, J. Le Roux, and S. Essid

In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Apr 2025

Abs Bib

We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.
@inproceedings{EG-ICASSP-25, title = {{O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization}}, author = {Gruttadauria, Elio and Fontaine, Mathieu and {Le Roux}, Jonathan and Essid, Slim}, booktitle = {ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, address = {Hyderabad, India}, year = {2025}, month = apr }
PERCEPTUAL NOISE-MASKING WITH MUSIC THROUGH DEEP SPECTRAL ENVELOPE SHAPING

C. Berger, R. Badeau, and S. Essid

In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Apr 2025

Abs Bib

People often listen to music in noisy environments, seeking to isolate themselves from ambient sounds. Indeed, a music signal can mask some of the noise’s frequency components due to the effect of simultaneous masking. In this article, we propose a neural network based on a psychoacoustic masking model, designed to enhance the music’s ability to mask ambient noise by reshaping its spectral envelope with predicted filter frequency responses. The model is trained with a perceptual loss function that balances two constraints: effectively masking the noise while preserving the original music mix and the user’s chosen listening level. We evaluate our approach on simulated data replicating a user’s experience of listening to music with headphones in a noisy environment. The results, based on defined objective metrics, demonstrate that our system improves the state of the art.
@inproceedings{CB-ICASSP-25, title = {{Perceptual Noise-Masking with Music through Deep Spectral Envelope Shaping}}, author = {Berger, Clémentine and Badeau, Roland and Essid, Slim}, booktitle = {ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, address = {Hyderabad, India}, year = {2025}, month = apr }
CONTRASTIVE KNOWLEDGE DISTILLATION FOR EMBEDDING REFINEMENT IN PERSONALIZED SPEECH ENHANCEMENT

T. Serre, M. Fontaine, E. Benhaim, and S. Essid

In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Apr 2025

Abs Bib

Personalized speech enhancement (PSE) has shown convincing results when it comes to extracting a known target voice among interfering ones. The corresponding systems usually incorporate a representation of the target voice within the enhancement system, which is extracted from an enrollment clip of the target voice with upstream models. Those models are generally heavy as the speaker embedding’s quality directly affects PSE performances. Yet, embeddings generated beforehand cannot account for the variations of the target voice during inference time. In this paper, we propose to perform on-the- fly refinement of the speaker embedding using a tiny speaker encoder. We first introduce a novel contrastive knowledge distil- lation methodology in order to train a 150k-parameter encoder from complex embeddings. We then use this encoder within the enhancement system during inference and show that the proposed method greatly improves PSE performances while maintaining a low computational load.
@inproceedings{TS-ICASSP-25, title = {{Contrastive Knowledge Distillation for Embedding Refinement in Personalized Speech Enhancement}}, author = {Serre, Thomas and Fontaine, Mathieu and Benhaim, Eric and Essid, Slim}, booktitle = {ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, address = {Hyderabad, India}, year = {2025}, month = apr }
MULTIPLE CHOICE LEARNING FOR EFFICIENT SPEECH SEPARATION WITH MANY SPEAKERS

D. Perera, F. Derrida, T. Mariotte, G. Richard, and S. Essid

In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Apr 2025

Accepted

Abs Bib PDF

Training speech separation models in the supervised setting raises a permutation problem: finding the best assignation between the model predictions and the ground truth separated signals. This inherently ambiguous task is customarily solved using Permutation Invariant Training (PIT). In this article, we instead consider using the Multiple Choice Learning (MCL) framework, which was originally introduced to tackle ambiguous tasks. We demonstrate experimentally on the popular WSJ0-mix and LibriMix benchmarks that MCL matches the performances of PIT, while being computationally advantageous. This opens the door to a promising research direction, as MCL can be naturally extended to handle a variable number of speakers, or to tackle speech separation in the unsupervised setting
@inproceedings{DP-ICASSP-25, title = {{Multiple Choice Learning for Efficient Speech Separation with Many Speakers}}, author = {Perera, David and Derrida, Francois and Mariotte, Théo and Richard, Ga{\"e}l and Essid, Slim}, booktitle = {ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, address = {Hyderabad, India}, note = {Accepted}, year = {2025}, month = apr }

2024

ANNEALED MULTIPLE CHOICE LEARNING: OVERCOMING LIMITATIONS OF WINNER-TAKES-ALL WITH ANNEALING

D. Perera, V. Letzelter, T. Mariotte, A. Cortés, M. Chen, S. Essid, and G. Richard

In Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , Dec 2024

Abs Bib PDF

We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minimum, due to the greedy nature of WTA. We overcome this limitation using annealing, which enhances the exploration of the hypothesis space during training. We leverage insights from statistical physics and information theory to provide a detailed description of the model training trajectory. Additionally, we validate our algorithm by extensive experiments on synthetic datasets on the standard UCI benchmark, and on speech separation.
@inproceedings{DP:NeurIPS-24, title = {Annealed Multiple Choice Learning: Overcoming Limitations of Winner-Takes-All with Annealing}, author = {Perera, David and Letzelter, Victor and Mariotte, Théo and Cortés, Adrien and Chen, Mickael and Essid, Slim and Richard, Gaël}, booktitle = {Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)}, address = {Vancouver, Canada}, year = {2024}, month = dec, }
AN EYE FOR AN EAR: ZERO-SHOT AUDIO DESCRIPTION LEVERAGING AN IMAGE CAPTIONER WITH AUDIO-VISUAL TOKEN DISTRIBUTION MATCHING

H. Malard, M. Olvera, S. Lathuilière, and S. Essid

In Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , Dec 2024

Abs Bib PDF

Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring audiovisual content. This can be achieved via multimodal alignment. Yet, this multimodal alignment task is non-trivial due to the inherent disparity between audible and visible elements in real-world videos. Moreover, multimodal representation learning often relies on contrastive learning, facing the challenge of the so-called modality gap which hinders smooth integration between modalities. In this work, we introduce a novel methodology for bridging the audiovisual modality gap by matching the distributions of tokens produced by an audio backbone and those of an image captioner. Our approach aligns the audio token distribution with that of the image tokens, enabling the model to perform zero-shot audio captioning in an unsupervised fashion. This alignment allows for the use of either audio or audiovisual input by combining or substituting the image encoder with the aligned audio encoder. Our method achieves significantly improved performances in zero-shot audio captioning, compared to existing approaches.
@inproceedings{HM:NeurIPS-24, title = {An Eye for an Ear: Zero-Shot Audio Description Leveraging an Image Captioner with Audio-Visual Token Distribution Matching}, author = {Malard, Hugo and Olvera, Michel and Lathuilière, Stéphane and Essid, Slim}, booktitle = {Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)}, address = {Vancouver, Canada}, year = {2024}, month = dec, }

ELECTROENCEPHALOGRAPHY RESPONSE DURING AN INCREMENTAL TEST ACCORDING TO THE V̇O2MAX PLATEAU INCIDENCE

V. Billat, C. Berthomier, M. Clémençon, M. Brandewinder, S. Essid, C. Damon, F. Rigaud, A. Bénichoux, E. Maby, L. Fornoni, P. Bouchet, P. Beers, B. Massot, P. Revol, L. Poinsard, T. Creveaux, C. Collet, J. Mattout, and V. Pialoux

Applied Sciences, Jun 2024

Bib PDF

@article{billat:hal-04688068,
  title = {{Electroencephalography Response during an Incremental Test According to the V̇O2max Plateau Incidence}},
  author = {Billat, V{\'e}ronique and Berthomier, Christian and Cl{\'e}men{\c c}on, Michel and Brandewinder, Marie and Essid, Slim and Damon, C{\'e}cilia and Rigaud, Fran{\c c}ois and B{\'e}nichoux, Alexis and Maby, Emmanuel and Fornoni, Lesly and Bouchet, Patrick and van Beers, Pascal and Massot, Bertrand and Revol, Patrice and Poinsard, Luc and Creveaux, Thomas and Collet, Christian and Mattout, J{\'e}r{\'e}mie and Pialoux, Vincent},
  url = {https://hal.science/hal-04688068},
  journal = {{Applied Sciences}},
  publisher = {{Multidisciplinary digital publishing institute (MDPI)}},
  volume = {14},
  number = {13},
  pages = {5411},
  year = {2024},
  month = jun,
  doi = {10.3390/app14135411},
  keywords = {EEG ; exhausting exercise ; maximal oxygen consumption ; fatigue ; central governor ; endurance ; cycling},
  hal_id = {hal-04688068},
  hal_version = {v1}
}

A SOUND DESCRIPTION: EXPLORING PROMPT TEMPLATES AND CLASS DESCRIPTIONS TO ENHANCE ZERO-SHOT AUDIO CLASSIFICATION

M. Olvera, P. Stamatiadias, and S. Essid

In International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE) , Nov 2024

Abs Bib

Audio-text models trained via contrastive learning offer a practical approach to perform audio classification through natural language prompts, such as “this is a sound of” followed by category names. In this work, we explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. First, we find that the formatting of the prompts significantly affects performance so that simply prompting the models with properly formatted class labels performs competitively with optimized prompt templates and even prompt ensembling. Moreover, we look into complementing class labels by audio-centric descriptions. By leveraging large language models, we generate textual descriptions that prioritize acoustic features of sound events to disambiguate between classes, without extensive prompt engineering. We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets. Remarkably, this method requires no additional training and remains fully zero-shot.
@inproceedings{MO:DCASE-24, title = {{A Sound Description: Exploring Prompt Templates and Class Descriptions to Enhance Zero-Shot Audio Classification}}, author = {Olvera, Michel and Stamatiadias, Paraskevas and Essid, Slim}, booktitle = {{International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE)}}, address = {Tokyo, Japan}, year = {2024}, month = nov }
SALT: STANDARDIZED AUDIO EVENT LABEL TAXONOMY

P. Stamatiadias, M. Olvera, and S. Essid

In International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE) , Nov 2024

Abs Bib

Machine listening systems often rely on fixed taxonomies to organize and label audio data, key for training and evaluating deep neural networks (DNNs) and other supervised algorithms. However, such taxonomies face significant constraints: they are composed of application-dependent predefined categories, which hinders the integration of new or varied sounds, and exhibits limited cross-dataset compatibility due to inconsistent labeling standards. To overcome these limitations, we introduce SALT: Standardized Audio event Label Taxonomy. Building upon the hierarchical structure of AudioSet’s ontology, our taxonomy extends and standardizes labels across 24 publicly available environmental sound datasets, allowing the mapping of class labels from diverse datasets to a unified system. Our proposal comes with a new Python package designed for navigating and utilizing this taxonomy, easing cross-dataset label searching and hierarchical exploration. Notably, our package allows effortless data aggregation from diverse sources, hence easy experimentation with combined datasets.
@inproceedings{PS:DCASE-24, title = {{SALT: Standardized Audio Event Label Taxonomy}}, author = {Stamatiadias, Paraskevas and Olvera, Michel and Essid, Slim}, booktitle = {{International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE)}}, address = {Tokyo, Japan}, year = {2024}, month = nov }
SPEECH SELF-SUPERVISED REPRESENTATIONS BENCHMARKING: A CASE FOR LARGER PROBING HEADS

S. Zaiem, Y. Kemiche, T. Parcollet, S. Essid, and M. Ravanelli

Computer Speech & Language, Nov 2024

Abs Bib

Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization, and multi-level feature exploitation.
@article{ZAIEM2025101695, title = {Speech self-supervised representations benchmarking: A case for larger probing heads}, journal = {Computer Speech & Language}, volume = {89}, pages = {101695}, year = {2024}, issn = {0885-2308}, doi = {https://doi.org/10.1016/j.csl.2024.101695}, url = {https://www.sciencedirect.com/science/article/pii/S0885230824000780}, author = {Zaiem, Salah and Kemiche, Youcef and Parcollet, Titouan and Essid, Slim and Ravanelli, Mirco}, keywords = {Self-supervised learning, Speech processing, Representation learning}, }

A CONTRASTIVE SELF-SUPERVISED LEARNING SCHEME FOR BEAT TRACKING AMENABLE TO FEW-SHOT LEARNING

A. Gagneré, S. Essid, and G. Peeters

In International conference on music information retrieval (ISMIR 2024) , Nov 2024

Bib

@inproceedings{Gagneré_Essid_Peeters_2024,
  address = {San Francisco, USA},
  title = {A Contrastive Self-Supervised Learning Scheme for Beat Tracking Amenable to Few-Shot Learning},
  booktitle = {International conference on music information retrieval (ISMIR 2024)},
  author = {Gagneré, Antonin and Essid, Slim and Peeters, Geoffroy},
  year = {2024},
  month = nov
}

MUSIC STRUCTURE ANALYSIS WITH EDGE-CONDITIONED GRAPH ATTENTION NETWORKS

M. Buisson, B. McFee, and S. Essid

In International conference on music information retrieval (ISMIR 2024) , Nov 2024

Bib

@inproceedings{Buisson_2024b,
  address = {San Francisco, USA},
  title = {Music Structure Analysis with Edge-conditioned Graph Attention Networks},
  booktitle = {International conference on music information retrieval (ISMIR 2024)},
  author = {Buisson, Morgan and McFee, Brian and Essid, Slim},
  year = {2024},
  month = nov
}

INVARIANCE-BASED LAYER REGULARIZATION FOR SOUND EVENT DETECTION

D. Perera, S. Essid, and G. Richard

In European signal processing conference (EUSIPCO 2024) , Aug 2024

Abs Bib

Experimental and theoretical evidences suggest that invariance constraints can improve the performance and generalization capabilities of a classification model. While invariance-based regularization has become part of the standard tool-belt of machine learning practitioners, this regularization is usually applied near the decision layers or at the end of the feature extracting layers of a deep classification network. However, the optimal placement of invariance constraints inside a deep classifier is yet an open question. In particular, it would be beneficial to link it to the structural properties of the network (e.g. its architecture), or its dynamical properties (e.g. the effectively used volume of its latent spaces). The purpose of this article is to initiate an investigation on these aspects. We use the experimental framework of the DCASE 2023 Task 4A challenge, which considers the training of a sound event classifier in a semi-supervised manner. We show that the optimal placement of invariance constraints improves the performance of the standard baseline for this task.
@inproceedings{Perera_Essid_Richard_2024, address = {Lyon, France}, title = {Invariance-based layer regularization for sound event detection}, booktitle = {European signal processing conference (EUSIPCO 2024)}, author = {Perera, David and Essid, Slim and Richard, Gael}, year = {2024}, month = aug }
WINNER-TAKES-ALL LEARNERS ARE GEOMETRY-AWARE CONDITIONAL DENSITY ESTIMATORS

V. Letzelter, D. Perera, C. Rommel, M. Fontaine, S. Essid, G. Richard, and P. Pérez

In International Conference on Machine Learning (ICML 2024) , Jul 2024

Abs Bib

Winner-takes-all training is a simple learning paradigm, in which the multiple predictions of so-called hypotheses are leveraged to tackle ambiguous tasks. Recently, a connection was established between winner-takes-all training and centroidal Voronoi tessellations, showing that, once trained, the hypotheses should quantize optimally the shape of the conditional distribution to predict. However, probabilistic reliability guarantees for the predictions are missing. In this work, we show how to take advantage of the appealing geometrical properties of the winner-takes-all learners for conditional density estimation, without modifying its original training scheme. We then discuss the competitiveness of our estimator based on novel theoretical and experimental results on both synthetic and audio data.
@inproceedings{Letzelter_Perera_Rommel_Fontaine_Essid_Richard_Pérez_2024, title = {Winner-takes-all learners are geometry-aware conditional density estimators}, url = {https://hal.science/hal-04574640/}, booktitle = {International Conference on Machine Learning (ICML 2024)}, author = {Letzelter, Victor and Perera, David and Rommel, Cédric and Fontaine, Mathieu and Essid, Slim and Richard, Gael and Pérez, Patrick}, address = {Vienna, Austria}, month = jul, year = {2024} }

ADAPTING PITCH-BASED SELF SUPERVISED LEARNING MODELS FOR TEMPO ESTIMATION

A. Gagneré, S. Essid, and G. Peeters

In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Jul 2024

Bib

@inproceedings{10447129,
  author = {Gagneré, Antonin and Essid, Slim and Peeters, Geoffroy},
  booktitle = {ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title = {Adapting Pitch-Based Self Supervised Learning Models for Tempo Estimation},
  year = {2024},
  volume = {},
  number = {},
  pages = {956-960},
  keywords = {Training;Adaptation models;Zero-shot learning;Estimation;Training data;Self-supervised learning;Transforms;tempo estimation;self-supervised-learning},
  doi = {10.1109/ICASSP48485.2024.10447129}
}

COLLABORATING FOUNDATION MODELS FOR DOMAIN GENERALIZED SEMANTIC SEGMENTATION

Y. Benigmim, S. Roy, S. Essid, V. Kalogeiton, and S. Lathuilière

In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024) , Jul 2024

Abs Bib PDF

Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work, we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP backbone for its robust feature representation, (ii) generative models to diversify the content, thereby covering various modes of the possible target distribution, and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions, notably outperforming prior methods by 5.6% and 6.7% on averaged miou, respectively.
@inproceedings{benigmim2023collaborating, title = {Collaborating Foundation models for Domain Generalized Semantic Segmentation}, author = {Benigmim, Yasser and Roy, Subhankar and Essid, Slim and Kalogeiton, Vicky and Lathuilière, Stéphane}, booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)}, year = {2024}, eprint = {2312.09788}, archiveprefix = {arXiv}, primaryclass = {cs.CV}, }
SELF-SUPERVISED LEARNING OF MULTI-LEVEL AUDIO REPRESENTATIONS FOR MUSIC SEGMENTATION

M. Buisson, B. Mcfee, S. Essid, and H. Crayencour

IEEE/ACM Transactions on Audio, Speech and Language Processing, Mar 2024

Abs Bib PDF

The task of music structure analysis refers to automatically identifying the location and the nature of musical sections within a song. In the supervised scenario, structural annotations generally result from exhaustive data collection processes, which represents one of the main challenges of this task. Moreover, both the subjectivity of music structure and the hierarchical characteristics it exhibits make the obtained structural annotations not fully reliable, in the sense that they do not convey a "universal ground-truth" unlike other tasks in music information retrieval. On the other hand, the quickly growing quantity of available music data has enabled weakly supervised and self-supervised approaches to achieve impressive results on a wide range of music-related problems. In this work, a self-supervised learning method is proposed to learn robust multi-level music representations prior to structural segmentation using contrastive learning. To this end, sets of frames sampled at different levels of detail are used to train a deep neural network in a disentangled manner. The proposed method is evaluated on both flat and multi-level segmentation. We show that each distinct sub-region of the output embeddings can efficiently account for structural similarity at their own targeted level of detail, which ultimately improves performance of downstream flat and multi-level segmentation. Finally, complementary experiments are carried out to study how the obtained representations can be further adapted to specific datasets using a supervised fine-tuning objective in order to facilitate structure retrieval in domains where human annotations remain scarce.
@article{buisson:hal-04485065, title = {{Self-Supervised Learning of Multi-level Audio Representations for Music Segmentation}}, author = {Buisson, Morgan and Mcfee, Brian and Essid, Slim and Crayencour, Helene-Camille}, journal = {IEEE/ACM Transactions on Audio, Speech and Language Processing}, url = {https://hal.science/hal-04485065}, year = {2024}, month = mar, keywords = {Music;Annotations;Task analysis;Training;Feature extraction;Self-supervised learning;Artificial neural networks;Music structure analysis;structural segmentation;representation learning}, hal_version = {v1}, volume = {32}, pages = {2141-2152}, doi = {10.1109/TASLP.2024.3379894}, }
ON THE CHOICE OF THE OPTIMAL TEMPORAL SUPPORT FOR AUDIO CLASSIFICATION WITH PRE-TRAINED EMBEDDINGS

A. Quelennec, M. Olvera, G. Peeters, and S. Essid

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) , Apr 2024

Abs Bib PDF

Current state-of-the-art audio analysis systems rely on pre-trained embedding models, often used off-the-shelf as (frozen) feature extractors. Choosing the best one for a set of tasks is the subject of many recent publications. However, one aspect often overlooked in these works is the influence of the duration of audio input considered to extract an embedding, which we refer to as Temporal Support (TS). In this work, we study the influence of the TS for well-established or emerging pre-trained embeddings, chosen to represent different types of architectures and learning paradigms. We conduct this evaluation using both musical instrument and environmental sound datasets, namely OpenMIC, TAU Urban Acoustic Scenes 2020 Mobile, and ESC-50. We especially highlight that Audio Spectrogram Transformer-based systems (PaSST and BEATs) remain effective with smaller TS, which therefore allows for a drastic reduction in memory and computational cost. Moreover, we show that by choosing the optimal TS we reach competitive results across all tasks. In particular, we improve the state-of-the-art results on OpenMIC, using BEATs and PaSST without any fine-tuning.
@inproceedings{quelennec:hal-04360221, title = {{On the Choice of the Optimal Temporal Support for Audio Classification with Pre-trained Embeddings}}, author = {Quelennec, Aurian and Olvera, Michel and Peeters, Geoffroy and Essid, Slim}, url = {https://hal.science/hal-04360221}, booktitle = {{IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)}}, year = {2024}, month = apr, keywords = {audio embeddings ; acoustic scene classification ; instrument recognition ; temporal support ; transformers ; Representation Model}, hal_id = {hal-04360221}, hal_version = {v1} }
ONLINE SPEAKER DIARIZATION OF MEETINGS GUIDED BY SPEECH SEPARATION

E. Gruttadauria, M. Fontaine, and S. Essid

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) , Apr 2024

Abs Bib PDF

Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.
@inproceedings{gruttadauria:hal-04419041, title = {{Online Speaker Diarization of Meetings Guided by Speech Separation}}, author = {Gruttadauria, Elio and Fontaine, Mathieu and Essid, Slim}, url = {https://hal.science/hal-04419041}, booktitle = {{IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)}}, address = {Seoul (Korea), South Korea}, year = {2024}, month = apr, keywords = {Speaker Diarization ; Source separation ; Online inference ; Overlapped speech ; AMI dataset ; Speaker embedding}, hal_id = {hal-04419041}, hal_version = {v1} }

2023

A REPETITION-BASED TRIPLET MINING APPROACH FOR MUSIC SEGMENTATION

M. Buisson, B. McFee, S. Essid, and H. Crayencour

In International Society for Music Information Retrieval (ISMIR) , Nov 2023

Abs Bib PDF

Contrastive learning has recently appeared as a well-suited method to find representations of music audio signals that are suitable for structural segmentation. However, most existing unsupervised training strategies omit the notion of repetition and therefore fail at encompassing this essential aspect of music structure. This work introduces a triplet mining method which explicitly considers repeating sequences occurring inside a music track by leveraging common audio descriptors. We study its impact on the learned representations through downstream music segmentation. Because musical repetitions can be of different natures, we give further insight on the role of the audio descriptors employed at the triplet mining stage as well as the trade-off existing between the quality of the triplets mined and the quantity of unlabelled data used for training. We observe that our method requires less non-annotated data while remaining competitive against other unsupervised methods trained on a larger corpus.
@inproceedings{buisson:ismir23, title = {A Repetition-based Triplet Mining Approach for Music Segmentation}, author = {Buisson, Morgan and McFee, Brian and Essid, Slim and Crayencour, Hélène C.}, booktitle = {International Society for Music Information Retrieval (ISMIR)}, address = {Milano, Italy}, year = {2023}, month = nov, }
RESILIENT MULTIPLE CHOICE LEARNING: A LEARNED SCORING SCHEME WITH APPLICATION TO AUDIO SCENE ANALYSIS

V. Letzelter, M. Fontaine, P. Perez, G. Richard, S. Essid, and M. Chen

In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023) , Dec 2023

Abs Bib PDF

We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.
@inproceedings{letzelter:hal-04216055, title = {Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis}, author = {Letzelter, Victor and Fontaine, Mathieu and Perez, Patrick and Richard, Gael and Essid, Slim and Chen, Mickael}, url = {https://hal.science/hal-04216055}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)}, address = {New Orleans, United States}, year = {2023}, month = dec, hal_id = {hal-04216055}, hal_version = {v1}, }

SPEECH SELF-SUPERVISED REPRESENTATION BENCHMARKING: ARE WE DOING IT RIGHT?

S. Zaiem, T. Parcollet, and S. Essid

In Interspeech , Aug 2023

Bib PDF

@inproceedings{zaiem:interspeech23,
  title = {Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?},
  author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim},
  booktitle = {Interspeech},
  address = {Dublin, Ireland},
  year = {2023},
  month = aug,
}

AUTOMATIC DATA AUGMENTATION FOR DOMAIN ADAPTED FINE-TUNING OF SELF-SUPERVISED SPEECH REPRESENTATIONS

S. Zaiem, T. Parcollet, and S. Essid

In Interspeech , Aug 2023

Abs Bib PDF

Self-Supervised Learning (SSL) has allowed leveraging large amounts of unlabeled speech data to improve the performance of speech recognition models even with small annotated datasets. Despite this, speech SSL representations may fail while facing an acoustic mismatch between the pretraining and target datasets. To address this issue, we propose a novel supervised domain adaptation method, designed for cases exhibiting such a mismatch in acoustic domains. It consists in applying properly calibrated data augmentations on a large clean dataset, bringing it closer to the target domain, and using it as part of an initial fine-tuning stage. Augmentations are automatically selected through the minimization of a conditional-dependence estimator, based on the target dataset. The approach is validated during an oracle experiment with controlled distortions and on two amateur-collected low-resource domains, reaching better performances compared to the baselines in both cases.
@inproceedings{zaiem:interspeech23b, title = {Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations}, author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim}, booktitle = {Interspeech}, address = {Dublin, Ireland}, year = {2023}, month = aug, }
ONE-SHOT UNSUPERVISED DOMAIN ADAPTATION WITH PERSONALIZED DIFFUSION MODELS

Y. Benigmim, S. Roy, S. Essid, V. Kalogeiton, and S. Lathuiliere

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , Jun 2023

Abs Bib

Adapting a segmentation model from a labeled source domain to a target domain, where a single unlabeled datum is available, is one the most challenging problems in domain adaptation and is otherwise known as one-shot unsupervised domain adaptation (OSUDA). Most of the prior works have addressed the problem by relying on style transfer techniques, where the source images are stylized to have the appearance of the target domain. Departing from the common notion of transferring only the target “texture” information, we leverage text-to-image diffusion models (e.g., Stable Diffusion) to generate a synthetic target dataset with photo-realistic images that not only faithfully depict the style of the target domain, but are also characterized by novel scenes in diverse contexts. The text interface in our method Data AugmenTation with diffUsion Models (DATUM) endows us with the possibility of guiding the generation of images towards desired semantic concepts while respecting the original spatial context of a single training image, which is not possible in existing OSUDA methods. Extensive experiments on standard benchmarks show that our DATUM surpasses the state-of-the-art OSUDA methods by up to +7.1%. The implementation is available at https://github.com/yasserben/DATUM
@inproceedings{Benigmim_2023_CVPR, author = {Benigmim, Yasser and Roy, Subhankar and Essid, Slim and Kalogeiton, Vicky and Lathuiliere, Stephane}, title = {One-Shot Unsupervised Domain Adaptation With Personalized Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = jun, year = {2023}, pages = {698-708} }

COSMOPOLITE SOUND MONITORING (COSMO): A STUDY OF URBAN SOUND EVENT DETECTION SYSTEMS GENERALIZING TO MULTIPLE CITIES

F. Angulo, S. Essid, G. Peeters, and C. Mietlicki

In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Jun 2023

Bib

@inproceedings{10095833,
  author = {Angulo, Florian and Essid, Slim and Peeters, Geoffroy and Mietlicki, Christophe},
  booktitle = {ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title = {Cosmopolite Sound Monitoring (CoSMo): A Study of Urban Sound Event Detection Systems Generalizing to Multiple Cities},
  year = {2023},
  pages = {1-5},
  doi = {10.1109/ICASSP49357.2023.10095833},
  abstrat = {
      Measuring noise in cities and automatically identifying the corresponding sound sources are a crucial challenge for policymakers. Indeed, such information helps addressing noise pollution and improving the well-being of urban dwellers. In recent years, researchers have provided annotated datasets recorded in two major cities to foster the development of urban sound event detection (SED) systems. This paper presents an in-depth study of the behaviour of state-of-the-art SED systems well suited to our problem, combining three far-field real recordings datasets which can be used jointly during training. In our evaluation, we highlight the performance gaps existing between simple and hard recording examples based on the salience of sound events and the polyphony of the recordings. We provide new proximity annotations for this analysis. We evaluate the ability of urban SED systems to generalize across cities with varying degrees of training supervision. We show that such generalization is hindered mostly by the difficulties current urban SED systems have to detect sound events with low salience along with sound events in highly polyphonic soundscapes.}
}

FINE-TUNING STRATEGIES FOR FASTER INFERENCE USING SPEECH SELF-SUPERVISED MODELS: A COMPARATIVE STUDY

S. Zaiem, R. Algayres, T. Parcollet, S. Essid, and M. Ravanelli

In ICASSP 2023 - International Conference on Acoustics, Speech, and Signal Processing , Jun 2023

Abs Bib PDF

Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger selfsupervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0.81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.
@inproceedings{zaiem:hal-04076307, title = {Fine-tuning strategies for faster inference using speech self-supervised models: a comparative study}, author = {Zaiem, Salah and Algayres, Robin and Parcollet, Titouan and Essid, Slim and Ravanelli, Mirco}, url = {https://hal.science/hal-04076307}, booktitle = {ICASSP 2023 - International Conference on Acoustics, Speech, and Signal Processing}, address = {Rhodes, Greece}, year = {2023}, month = jun, keywords = {Speech recognition ; Self-supervised learning}, hal_id = {hal-04076307}, hal_version = {v1} }

2022

LATENT AND ADVERSARIAL DATA AUGMENTATION FOR SOUND EVENT DETECTION AND CLASSIFICATION

D. Perera, S. Essid, and G. Richard

In International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE) , Nov 2022

Abs Bib PDF

Invariance-based learning is a promising approach in deep learning. Among other benefits, it can mitigate the lack of diversity of available datasets and increase the interpretability of trained models. To this end, practitioners often use a consistency cost penalizing the sensitivity of a model to a set of carefully selected data augmentations. However, there is no consensus about how these augmentations should be selected. In this paper, we study the behavior of several augmentation strategies. We consider the task of sound event detection and classification for our experiments. In particular, we show that transformations operating on the internal layers of a deep neural network are beneficial for this task.
@inproceedings{perera:hal-03782827, title = {{Latent and Adversarial Data Augmentation for Sound Event Detection and Classification}}, author = {Perera, David and Essid, Slim and Richard, Ga{\"e}l}, url = {https://hal.science/hal-03782827}, booktitle = {{International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE)}}, address = {Nancy, France}, year = {2022}, month = nov, keywords = {sound event detection ; data augmentation ; adversarial learning}, hal_id = {hal-03782827}, hal_version = {v1} }
LATENT AND ADVERSARIAL DATA AUGMENTATION FOR SOUND EVENT DETECTION AND CLASSIFICATION

D. Perera, S. Essid, and G. Richard

In International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE) , Nov 2022

Abs Bib PDF

Invariance-based learning is a promising approach in deep learning. Among other benefits, it can mitigate the lack of diversity of available datasets and increase the interpretability of trained models. To this end, practitioners often use a consistency cost penalizing the sensitivity of a model to a set of carefully selected data augmentations. However, there is no consensus about how these augmentations should be selected. In this paper, we study the behavior of several augmentation strategies. We consider the task of sound event detection and classification for our experiments. In particular, we show that transformations operating on the internal layers of a deep neural network are beneficial for this task.
@inproceedings{perera:hal-03782828, title = {{Latent and Adversarial Data Augmentation for Sound Event Detection and Classification}}, author = {Perera, David and Essid, Slim and Richard, Ga{\"e}l}, url = {https://hal.science/hal-03782827}, booktitle = {{International workshop on Detection and Classiffication of Acoustic Scenes and Events (DCASE)}}, address = {Nancy, France}, year = {2022}, month = nov, keywords = {sound event detection ; data augmentation ; adversarial learning}, hal_id = {hal-03782827}, hal_version = {v1} }

IMPACT DE PERTURBATIONS INTERNES SUR L ENTRAINEMENT DE RESEAUX PROFONDS POUR LA DETECTION D EVENEMENTS SONORES

D. Perera, S. Essid, and G. Richard

In Colloque Francophone de Traitement du Signal et des Images (GRETSI) , Sep 2022

Bib PDF

@inproceedings{perera:hal-03759651,
  title = {Impact de perturbations internes sur l entrainement de reseaux profonds pour la detection d evenements sonores},
  author = {Perera, David and Essid, Slim and Richard, Gael},
  url = {https://hal.telecom-paris.fr/hal-03759651},
  booktitle = {Colloque Francophone de Traitement du Signal et des Images (GRETSI)},
  address = {Nancy, France},
  year = {2022},
  month = sep,
  hal_id = {hal-03759651},
  hal_version = {v1}
}

PRETEXT TASKS SELECTION FOR MULTITASK SELF-SUPERVISED AUDIO REPRESENTATION LEARNING

S. Zaiem, T. Parcollet, S. Essid, and A. Heba

IEEE Journal of Selected Topics in Signal Processing, Sep 2022

Abs Bib PDF

Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.
@article{9846981, author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim and Heba, Abdelwahab}, journal = {IEEE Journal of Selected Topics in Signal Processing}, title = {Pretext Tasks Selection for Multitask Self-Supervised Audio Representation Learning}, year = {2022}, volume = {16}, number = {6}, pages = {1439-1453}, doi = {10.1109/JSTSP.2022.3195430}, }

AUTOMATIC DATA AUGMENTATION SELECTION AND PARAMETRIZATION IN CONTRASTIVE SELF-SUPERVISED SPEECH REPRESENTATION LEARNING

S. Zaiem, T. Parcollet, and S. Essid

In Proc. Interspeech 2022 , Sep 2022

Bib PDF

@inproceedings{zaiem22_interspeech,
  author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim},
  title = {Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning},
  year = {2022},
  booktitle = {Proc. Interspeech 2022},
  pages = {669--673},
  doi = {10.21437/Interspeech.2022-10191},
}

LEARNING MULTI-LEVEL REPRESENTATIONS FOR HIERARCHICAL MUSIC STRUCTURE ANALYSIS

M. Buisson, B. McFee, S. Essid, and H. Crayencour

In International Society for Music Information Retrieval (ISMIR) , Dec 2022

Abs Bib PDF

Recent work in music structure analysis has shown the potential of deep features to highlight the underlying structure of music audio signals. Despite promising results achieved by such representations, dealing with the inherent hierarchical aspect of music structure remains a challenging problem. Because different levels of segmentation can be considered as equally valid, specifically designed representations should be optimized to improve hierarchical structure analysis. In this work, unsupervised learning of such representations using a contrastive approach operating at different timescales is explored. The proposed system is evaluated on flat and multi-level music segmentation. By leveraging both time and the hierarchical organization of music structure, we show that the obtained deep embeddings can encode meaningful patterns and improve segmentation at various levels of granularity.
@inproceedings{buisson:hal-03780032, address = {Bengaluru, India}, author = {Buisson, Morgan and McFee, Brian and Essid, Slim and Crayencour, Hélène C.}, booktitle = {International Society for Music Information Retrieval (ISMIR)}, hal_id = {hal-03780032}, hal_version = {v1}, month = dec, title = {Learning Multi-Level Representations for Hierarchical Music Structure Analysis}, url = {https://hal.archives-ouvertes.fr/hal-03780032}, year = {2022}, }

OPINIONS IN INTERACTIONS : NEW ANNOTATIONS OF THE SEMAINE DATABASE

V. Barrière, C. Clavel, and S. Essid

In LREC , Jun 2022

Bib PDF

@inproceedings{barriere:hal-04276012,
  title = {{Opinions in Interactions : New Annotations of the SEMAINE Database}},
  author = {Barri{\`e}re, Valentin and Clavel, Chlo{\'e} and Essid, Slim},
  url = {https://hal.science/hal-04276012},
  booktitle = {{LREC}},
  address = {Marseille, France},
  year = {2022},
  month = jun,
  keywords = {Opinion Multimodal Machine Learning Interactions ; Opinion ; Multimodal Machine Learning ; Interactions},
  hal_id = {hal-04276012},
  hal_version = {v1}
}

2021

DNN-BASED MASK ESTIMATION FOR DISTRIBUTED SPEECH ENHANCEMENT IN SPATIALLY UNCONSTRAINED MICROPHONE ARRAYS

N. Furnon, R. Serizel, S. Essid, and I. Illina

IEEE/ACM Transactions on Audio, Speech and Language Processing, Jun 2021

Abs Bib PDF

Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to extend a previously introduced distributed DNN-based time-frequency mask estimation scheme that can efficiently use spatial information in form of so-called compressed signals which are pre-filtered target estimations. We study the performance of this algorithm named Tango under realistic acoustic conditions and investigate practical aspects of its optimal application. We show that the nodes in the microphone array cooperate by taking profit of their spatial coverage in the room. We also propose to use the compressed signals not only to convey the target estimation but also the noise estimation in order to exploit the acoustic diversity recorded throughout the microphone array.
@article{furnon:hal-02985867, title = {DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays}, author = {Furnon, Nicolas and Serizel, Romain and Essid, Slim and Illina, Irina}, url = {https://hal.archives-ouvertes.fr/hal-02985867}, journal = {IEEE/ACM Transactions on Audio, Speech and Language Processing}, volume = {29}, pages = {2310 - 2323}, year = {2021}, doi = {10.1109/TASLP.2021.3092838}, hal_id = {hal-02985867}, hal_version = {v3}, }

USER-GUIDED ONE-SHOT DEEP MODEL ADAPTATION FOR MUSIC SOURCE SEPARATION

g. Cantisani, A. Ozerov, S. Essid, and G. Richard

In 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , Oct 2021

Bib PDF

@inproceedings{cantisani:hal-03219350,
  title = {User-guided one-shot deep model adaptation for music source separation},
  author = {Cantisani, giorgia and Ozerov, Alexey and Essid, Slim and Richard, Gael},
  url = {https://hal.telecom-paris.fr/hal-03219350},
  booktitle = {2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  year = {2021},
  month = oct,
  address = {New Paltz, USA},
  keywords = {Music Source Separation ; User-guided Source Separation ; One-shot Domain Adaptation},
  hal_id = {hal-03219350},
  hal_version = {v3}
}

ATTENTION-BASED DISTRIBUTED SPEECH ENHANCEMENT FOR UNCONSTRAINED MICROPHONE ARRAYS WITH VARYING NUMBER OF NODES

N. Furnon, R. Serizel, S. Essid, and I. Illina

In European Signal Processing Conference (EUSIPCO) , Aug 2021

Bib PDF

@inproceedings{furnon:hal-03259801,
  title = {Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes},
  author = {Furnon, Nicolas and Serizel, Romain and Essid, Slim and Illina, Irina},
  url = {https://hal.archives-ouvertes.fr/hal-03259801},
  booktitle = {European Signal Processing Conference (EUSIPCO)},
  address = {Dublin/ Virtual, Ireland},
  organization = {IEEE},
  year = {2021},
  month = aug,
  keywords = {Speech enhancement ; distributed processing ; attention mechanisms ; ad-hoc microphone arrays},
  hal_id = {hal-03259801},
  hal_version = {v1}
}

DISTRIBUTED SPEECH SEPARATION IN SPATIALLY UNCONSTRAINED MICROPHONE ARRAYS

N. Furnon, R. Serizel, I. Illina, and S. Essid

In ICASSP 2021 - 46th International Conference on Acoustics, Speech, and Signal Processing , Jun 2021

Bib PDF

@inproceedings{furnon:hal-02985794,
  title = {Distributed speech separation in spatially unconstrained microphone arrays},
  author = {Furnon, Nicolas and Serizel, Romain and Illina, Irina and Essid, Slim},
  url = {https://hal.archives-ouvertes.fr/hal-02985794},
  booktitle = {ICASSP 2021 - 46th International Conference on Acoustics, Speech, and Signal Processing},
  address = {Toronto, Canada},
  year = {2021},
  month = jun,
  keywords = {Speech separation ; Microphone arrays ; Distributed processing ; Speech separation},
  hal_id = {hal-02985794},
  hal_version = {v2}
}

NEURO-STEERED MUSIC SOURCE SEPARATION WITH EEG-BASED AUDITORY ATTENTION DECODING AND CONTRASTIVE-NMF

G. Cantisani, S. Essid, and G. Richard

In ICASSP 2021 - 46th International Conference on Acoustics, Speech, and Signal Processing , Jun 2021

Abs Bib PDF

We propose a novel informed music source separation paradigm, which can be referred to as neuro-steered music source separation. More precisely, the source separation process is guided by the user’s selective auditory attention decoded from his/her EEG response to the stimulus. This high-level prior information is used to select the desired instrument to isolate and to adapt the generic source separation model to the observed signal. To this aim, we leverage the fact that the attended instrument’s neural encoding is substantially stronger than the one of the unattended sources left in the mixture. This "contrast" is extracted using an attention decoder and used to inform a source separation model based on non-negative matrix fac-torization named Contrastive-NMF. The results are promising and show that the EEG information can automatically select the desired source to enhance and improve the separation quality.
@inproceedings{cantisani:hal-02978978, title = {NEURO-STEERED MUSIC SOURCE SEPARATION WITH EEG-BASED AUDITORY ATTENTION DECODING AND CONTRASTIVE-NMF}, author = {Cantisani, Giorgia and Essid, Slim and Richard, Gael}, url = {https://hal.telecom-paris.fr/hal-02978978}, booktitle = {ICASSP 2021 - 46th International Conference on Acoustics, Speech, and Signal Processing}, address = {Toronto, Canada}, year = {2021}, month = jun, keywords = {Index Terms-Audio source separation ; Auditory attention decoding ; Polyphonic music ; EEG ; Audio source separation}, hal_id = {hal-02978978}, hal_version = {v4} }

CONDITIONAL INDEPENDENCE FOR PRETEXT TASK SELECTION IN SELF-SUPERVISED SPEECH REPRESENTATION LEARNING

S. Zaiem, T. Parcollet, and S. Essid

In Interspeech , Aug 2021

Bib PDF

@inproceedings{zaiem21_interspeech,
  author = {Zaiem, Salah and Parcollet, Titouan and Essid, Slim},
  title = {Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning},
  year = {2021},
  month = aug,
  booktitle = {Interspeech},
  pages = {2851--2855},
  doi = {10.21437/Interspeech.2021-1027},
}

2020

METHOD AND SYSTEM FOR BROADCASTING A MULTICHANNEL AUDIO STREAM TO TERMINALS OF SPECTATORS ATTENDING A SPORTS EVENT

R. Blouet, and S. Essid

Patent Application, Sep 2020

Bib PDF

@article{SE:patent20,
  author = {Blouet, Raphael and Essid, Slim},
  title = {Method and System for Broadcasting a Multichannel Audio Stream to Terminals of Spectators Attending a Sports Event},
  year = {2020},
  month = sep,
  journal = {Patent Application},
  number = {US 2021/0014627 A1}
}

DNN-BASED DISTRIBUTED MULTICHANNEL MASK ESTIMATION FOR SPEECH ENHANCEMENT IN MICROPHONE ARRAYS

N. Furnon, R. Serizel, I. Illina, and S. Essid

In ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing , May 2020

Bib PDF

@inproceedings{furnon:hal-02389159,
  title = {DNN-Based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays},
  author = {Furnon, Nicolas and Serizel, Romain and Illina, Irina and Essid, Slim},
  url = {https://hal.archives-ouvertes.fr/hal-02389159},
  booktitle = {ICASSP 2020 - 45th International Conference on Acoustics, Speech, and Signal Processing},
  address = {Barcelona, Spain},
  year = {2020},
  month = may,
  keywords = {Index Terms-Speech enhancement ; dis- tributed processing ; microphone arrays ; Distributed processing ; Speech enhancement},
  hal_id = {hal-02389159},
  hal_version = {v3}
}

2019

WEAKLY SUPERVISED REPRESENTATION LEARNING FOR AUDIO-VISUAL SCENE ANALYSIS

S. Parekh, S. Essid, A. Ozerov, N. Duong, P. Pérez, and G. Richard

IEEE/ACM Transactions on Audio, Speech, and Language Processing, May 2019

Abs Bib PDF

Audiovisual (AV) representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. Specifically, we develop methods that identify events and localize corresponding AV cues in unconstrained videos. Importantly, this is done using weak labels where only video-level event labels are known without any information about their location in time. We show that the learnt representations are useful for performing several tasks such as event/object classification, audio event detection, audio source separation and visual object localization. An important feature of our method is its capacity to learn from unsynchronized audiovisual events. We also demonstrate our framework’s ability to separate out the audio source of interest through a novel use of nonnegative matrix factorization. State-of-the-art classification results, with a F1-score of 65.0, are achieved on DCASE 2017 smart cars challenge data with promising generalization to diverse object types such as musical instruments. Visualizations of localized visual regions and audio segments substantiate our system’s efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously.
@article{8926380, author = {Parekh, S. and Essid, Slim and Ozerov, A. and Duong, N. Q. K. and Pérez, P. and Richard, G.}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title = {Weakly Supervised Representation Learning for Audio-Visual Scene Analysis}, year = {2019}, volume = {28}, number = {}, pages = {416-428}, }

ON-THE-FLY DETECTION OF USER ENGAGEMENT DECREASE IN SPONTANEOUS HUMAN-ROBOT INTERACTION

A. Ben Youssef, G. Varni, S. Essid, and C. Clavel

International Journal of Social Robotics, Jan 2019

Bib

@article{benyoussef:hal-02288044,
  title = {On-the-fly Detection of User Engagement Decrease in Spontaneous Human-Robot Interaction},
  author = {Ben Youssef, Atef and Varni, Giovanna and Essid, Slim and Clavel, Chloe},
  url = {https://hal.telecom-paris.fr/hal-02288044},
  journal = {International Journal of Social Robotics},
  hal_local_reference = {ABY:IJSR-2019},
  year = {2019},
  month = jan,
  keywords = {User engagement decrease ; Socially assistive robot ; HRI in public space ; Real-time detection},
  hal_id = {hal-02288044},
  hal_version = {v1}
}

A MULTIMODAL MOVIE REVIEW CORPUS FOR FINE-GRAINED OPINION MINING

A. Garcia, S. Essid, F. DAlche-Buc, and C. Clavel

Jan 2019

Bib

@techreport{DBLP:journals/corr/abs-1902-10102,
  author = {Garcia, Alexandre and Essid, Slim and DAlche-Buc, Florence and Clavel, Chloe},
  title = {A multimodal movie review corpus for fine-grained opinion mining},
  volume = {abs/1902.10102},
  year = {2019},
  url = {http://arxiv.org/abs/1902.10102},
  archiveprefix = {arXiv},
  eprint = {1902.10102},
  timestamp = {Fri, 24 May 2019 10:20:38 +0200},
  biburl = {https://dblp.org/rec/bib/journals/corr/abs-1902-10102},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

FROM THE TOKEN TO THE REVIEW: A HIERARCHICAL MULTIMODAL APPROACH TO OPINION MINING

A. Garcia, P. Colombo, F. DAlche-Buc, S. Essid, and C. Clavel

In 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing , Nov 2019

Bib

@inproceedings{Garcia2019,
  author = {Garcia, Alexandre and Colombo, Pierre and DAlche-Buc, Florence and Essid, Slim and Clavel, Chloe},
  booktitle = {2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing},
  title = {From the Token to the Review: A Hierarchical Multimodal approach to Opinion Mining},
  year = {2019},
  month = nov,
  address = {Hong Kong, China}
}

MAD-EEG: AN EEG DATASET FOR DECODING AUDITORY ATTENTION TO A TARGET INSTRUMENT IN POLYPHONIC MUSIC

G. Cantisani, G. Tregoat, S. Essid, and G. Richard

In Speech, Music and Mind (SMM19), Satellite workshop of Interspeech 2019 , Nov 2019

Bib

@inproceedings{Cantisani2019b,
  author = {Cantisani, Giorgia and Tregoat, Gabriel and Essid, Slim and Richard, Gael},
  booktitle = {Speech, Music and Mind (SMM19), Satellite workshop of Interspeech 2019},
  title = {MAD-EEG: an EEG dataset for decoding auditory attention to a target	instrument in polyphonic music},
  year = {2019},
  address = {Vienna, Austria}
}

IDENTIFY, LOCATE AND SEPARATE: AUDIO-VISUAL OBJECT EXTRACTION IN LARGE VIDEO COLLECTIONS USING WEAK SUPERVISION

S. Parekh, A. Ozerov, S. Essid, N. Duong, P. Perez, and G. Richard

In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , Oct 2019

Bib

@inproceedings{Parekh2019,
  author = {Parekh, Sanjeel and Ozerov, Alexey and Essid, Slim and Duong, Ngoc Q. K. and Perez, Patrick and Richard, Gael},
  booktitle = {2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  title = {Identify, Locate and Separate: Audio-visual Object Extraction in Large Video Collections Using Weak Supervision},
  year = {2019},
  month = oct,
  address = {New Paltz, USA}
}

EEG-BASED DECODING OF AUDITORY ATTENTION TO A TARGET INSTRUMENT IN POLYPHONIC MUSIC

G. Cantisani, S. Essid, and G. Richard

In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , Oct 2019

Bib

@inproceedings{Cantisani2019,
  author = {Cantisani, Giorgia and Essid, Slim and Richard, Gael},
  booktitle = {2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  title = {EEG-based Decoding of Auditory Attention to a Target Instrument in Polyphonic Music},
  year = {2019},
  address = {New Paltz, USA},
  month = oct
}

SAMBASET: A DATASET OF HISTORICAL SAMBA DE ENREDO RECORDINGS FOR COMPUTATIONAL MUSIC ANALYSIS

L. Maia, M. Fuentes, L. Biscainho, M. Rocamora, and S. Essid

In The 20th International Society for Music Information Retrieval Conference , Nov 2019

Bib

@inproceedings{Maia2019,
  author = {Maia, Lucas S. and Fuentes, Magdalena and Biscainho, Luiz W. P. and Rocamora, Martin and Essid, Slim},
  booktitle = {The 20th International Society for Music Information Retrieval Conference},
  title = {SAMBASET: A Dataset of Historical Samba de Enredo Recordings for Computational Music Analysis},
  month = nov,
  address = {Delft, The Netherlands},
  year = {2019}
}

TRACKING BEATS AND MICROTIMING IN AFRO-LATIN AMERICAN MUSIC USING CONDITIONAL RANDOM FIELDS AND DEEP LEARNING

M. Fuentes, L. Maia, M. Rocamora, L. Biscainho, H. Crayencour, S. Essid, and J. Bello

In The 20th International Society for Music Information Retrieval Conference , Nov 2019

Bib

@inproceedings{Fuentes2019b,
  author = {Fuentes, Magdalena and Maia, Lucas S. and Rocamora, Martin and Biscainho, Luiz W. P. and Crayencour, Helene C. and Essid, Slim and Bello, Juan Pablo},
  booktitle = {The 20th International Society for Music Information Retrieval Conference},
  title = {Tracking Beats And Microtiming In Afro-latin American
  	music Using Conditional Random Fields and Deep Learning},
  month = nov,
  address = {Delft, The Netherlands},
  year = {2019}
}

A MUSIC STRUCTURE INFORMED DOWNBEAT TRACKING SYSTEM USING SKIP-CHAIN CONDITIONAL RANDOM FIELDS AND DEEP LEARNING

M. Fuentes, B. McFee, H. Crayencour, S. Essid, and J. Bello

In IEEE International Conference on Acoustics, Speech and Signal processing , May 2019

Bib

@inproceedings{Fuentes2019,
  author = {Fuentes, Magdalena and McFee, Brian and Crayencour, Helene and Essid, Slim and Bello, Juan Pablo},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal processing},
  title = {A Music Structure Informed Downbeat Tracking System Using Skip-Chain Conditional Random Fields and Deep Learning},
  month = may,
  address = {Brighton, UK},
  year = {2019}
}

AUDIOVISUAL ANALYSIS OF MUSIC PERFORMANCES: OVERVIEW OF AN EMERGING FIELD

Z. Duan, S. Essid, C. Liem, G. Richard, and G. Sharma

IEEE Signal Processing Magazine, Jan 2019

Bib

@article{Duan2019,
  author = {Duan, Z. and Essid, S. and Liem, C. C. S. and Richard, G. and Sharma, G.},
  journal = {IEEE Signal Processing Magazine},
  title = {Audiovisual Analysis of Music Performances: Overview of an Emerging Field},
  year = {2019},
  volume = {36},
  number = {1},
  pages = {63-73},
  keywords = {audio signal processing;music;audiovisual analysis;audio modality;music recordings;audio-only media;automated music analysis;audio signals;acoustic music rendering;Music;Visualization;Task analysis;Instruments;Multiple signal classification;Signal processing;Cameras;Acoustics},
  doi = {10.1109/MSP.2018.2875511},
  issn = {1053-5888},
  month = jan
}

EARLY DETECTION OF USER ENGAGEMENT BREAKDOWN IN SPONTANEOUS HUMAN-HUMANOID INTERACTION

A. Ben Youssef, C. Clavel, and S. Essid

IEEE Transactions on Affective Computing, Jan 2019

Bib

@article{BenYoussef2019,
  author = {Ben Youssef, A and Clavel, C and Essid, S},
  doi = {10.1109/TAFFC.2019.2898399},
  issn = {1949-3045},
  journal = {IEEE Transactions on Affective Computing},
  keywords = {Robots;Electric breakdown;Feature extraction;Predi},
  pages = {1},
  title = {Early Detection of User Engagement Breakdown in Spontaneous Human-Humanoid Interaction},
  year = {2019}
}

2018

PROCEDE ET SYSTEME DE DIFFUSION D UN FLUX AUDIO MULTICANAL A DES TERMINAUX DE SPECTATEURS ASSISTANT A UN EVENEMENT SPORTIF

R. Blouet, and S. Essid

Patent Application, Mar 2018

Bib PDF

@article{SE:patent18,
  author = {Blouet, Raphael and Essid, Slim},
  title = {Procede et Systeme de Diffusion d un Flux Audio Multicanal a des terminaux de spectateurs assistant a un evenement sportif},
  year = {2018},
  month = mar,
  journal = {Patent Application},
  number = {1852774}
}

MEDLEY-SOLOS-DB: A CROSS-COLLECTION DATASET FOR MUSICAL INSTRUMENT RECOGNITION

V. Lostanlen, C. Cella, R. Bittner, and S. Essid

Sep 2018

Bib

@misc{lostanlen_vincent_2018_1344103,
  author = {Lostanlen, Vincent and Cella, Carmine-Emanuele and Bittner, Rachel and Essid, Slim},
  title = {Medley-solos-DB: a cross-collection dataset for
  	musical instrument recognition},
  month = sep,
  year = {2018},
  doi = {10.5281/zenodo.1344103},
  url = {https://doi.org/10.5281/zenodo.1344103}
}

EEG-BASED INTER-SUBJECT CORRELATION SCHEMES IN A STIMULI-SHARED FRAMEWORK: INTERPLAY WITH VALENCE AND AROUSAL

A. Hajlaoui, M. Chetouani, and S. Essid

Sep 2018

Bib

@article{Hajlaoui2018,
  archiveprefix = {arXiv},
  arxivid = {1809.08273},
  author = {Hajlaoui, Ayoub and Chetouani, Mohamed and Essid, Slim},
  eprint = {1809.08273},
  month = sep,
  title = {EEG-based Inter-Subject Correlation Schemes in a Stimuli-Shared Framework: Interplay with Valence and Arousal},
  url = {https://arxiv.org/abs/1809.08273},
  year = {2018}
}

ANALYSIS OF COMMON DESIGN CHOICES IN DEEP LEARNING SYSTEMS FOR DOWNBEAT TRACKING

M. Fuentes, B. McFee, H. Crayencour, S. Essid, and J. Bello

In Proceedings of the 19th International Society for Music Information Retrieval Conference , Sep 2018

Bib

@inproceedings{Fuentes2018,
  address = {Paris, France},
  author = {Fuentes, Magdalena and McFee, Brian and Crayencour, Helene C and Essid, Slim and Bello, Juan Pablo},
  booktitle = {Proceedings of the 19th International Society for Music Information Retrieval Conference},
  doi = {10.5281/zenodo.1492355},
  month = sep,
  pages = {106--112},
  publisher = {ISMIR},
  title = {Analysis of Common Design Choices in Deep Learning Systems for Downbeat Tracking},
  url = {https://doi.org/10.5281/zenodo.1492355},
  year = {2018}
}

MAIN MELODY ESTIMATION WITH SOURCE-FILTER NMF AND CRNN

D. Basaran, S. Essid, and G. Peeters

In Proceedings of the 19th International Society for Music Information Retrieval Conference , Sep 2018

Bib

@inproceedings{Basaran2018,
  address = {Paris, France},
  author = {Basaran, Dogac and Essid, Slim and Peeters, Geoffroy},
  booktitle = {Proceedings of the 19th International Society for Music Information Retrieval Conference},
  doi = {10.5281/zenodo.1492349},
  month = sep,
  pages = {82--89},
  publisher = {ISMIR},
  title = {Main Melody Estimation with Source-Filter NMF and CRNN},
  url = {https://doi.org/10.5281/zenodo.1492349},
  year = {2018}
}

A ROBUST AUDIO CLASSIFICATION SYSTEM FOR DETECTING PULMONARY EDEMA

K. Hong, S. Essid, W. Ser, and D. Foo

Biomedical Signal Processing and Control, Sep 2018

Abs Bib

In this paper we present a robust audio classification system to efficiently detect pulmonary edema. The system uses a feature learning technique based on (NMF), then classified with logistic regression. A study was done to compare feature engineering approaches with feature selection techniques against NMF. Different NMF schemes were investigated and also compared with Principal Component Analysis. NMF scored 95% F1 score, which was superior to feature engineering techniques that had scores from 83% to 93%. Background noise collected from hospitals and speech from a speech corpus database was used to simulate noisy data. The system was then tested using noisy data. The best NMF scheme scored 74%, while other feature engineering techniques scored lower; from 66% to 71%. NMF was also used as a signal enhancement tool. It improved the F1 score to 77%. Lastly, only inhalations from breath sounds were considered and this further improved classification results to 86%. The proposed robust classification system using NMF thus proved to be an effective method for audio-based detection of pulmonary edema. If implemented in real-time, the proposed system can be used as a screening tool.
@article{Hong2018, author = {Hong, K. J. and Essid, S. and Ser, W. and Foo, D. C.G.}, doi = {10.1016/j.bspc.2018.07.004}, issn = {17468108}, journal = {Biomedical Signal Processing and Control}, keywords = {Biomedical signal processing,Feature learning,Non-negative matrix factorization,Pulmonary edema,Robust testing}, title = {A robust audio classification system for detecting pulmonary edema}, year = {2018} }

MULTI-TASK FEATURE LEARNING FOR EEG-BASED EMOTION RECOGNITION USING GROUP NONNEGATIVE MATRIX FACTORIZATION

A. Hajlaoui, M. Chetouani, and S. Essid

In The European Signal Processing Conference (EUSIPCO) , Sep 2018

Bib

@inproceedings{Hajlaoui2019,
  address = {Rome, Italy},
  author = {Hajlaoui, Ayoub and Chetouani, Mohamed and Essid, Slim},
  booktitle = {The European Signal Processing Conference (EUSIPCO)},
  title = {Multi-task Feature Learning for EEG-based Emotion Recognition Using Group Nonnegative Matrix Factorization},
  month = sep,
  year = {2018}
}

STRUCTURED OUTPUT LEARNING WITH ABSTENTION: APPLICATION TO ACCURATE OPINION PREDICTION

A. Garcia, S. Essid, C. Clavel, and F. DAlche-Buc

In International Conference on Machine Learning (ICML) , Jul 2018

Abs Bib

Motivated by Supervised Opinion Analysis, we propose a novel framework devoted to Structured Output Learning with Abstention (SOLA). The structure prediction model is able to abstain from predicting some labels in the structured output at a cost chosen by the user in a flexible way. For that purpose, we decompose the problem into the learning of a pair of predictors, one devoted to structured abstention and the other, to structured output prediction. To compare fully labeled training data with predictions potentially containing abstentions, we define a wide class of asymmetric abstention-aware losses. Learning is achieved by surrogate regression in an appropriate feature space while prediction with abstention is performed by solving a new pre-image problem. Thus, SOLA extends recent ideas about Structured Output Prediction via surrogate problems and calibration theory and enjoys statistical guarantees on the resulting excess risk. Instantiated on a hierarchical abstention-aware loss, SOLA is shown to be relevant for fine-grained opinion mining and gives state-of-the-art results on this task. Moreover, the abstention-aware representations can be used to competitively predict user-review ratings based on a sentence-level opinion predictor.
@inproceedings{Garcia2018b, author = {Garcia, Alexandre and Essid, Slim and Clavel, Chloe and DAlche-Buc, Florence}, month = jul, title = {Structured Output Learning with Abstention: Application to Accurate Opinion Prediction}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2018}, address = {Stockholm, Sweeden} }

WEAKLY SUPERVISED REPRESENTATION LEARNING FOR UNSYNCHRONIZED AUDIO-VISUAL EVENTS

S. Parekh, S. Essid, A. Ozerov, Q. Duong, P. Perez, and G. Richard

In CVPR Workshop on Sight and Sound (WSS) , Jun 2018

Bib

@inproceedings{Parekh2018b,
  author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Quang-Khanh-Ngoc and Perez, Patrick and Richard, Gael},
  booktitle = {CVPR Workshop on Sight and Sound (WSS)},
  month = jun,
  title = {Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events},
  year = {2018},
  address = {Salt Lake City, USA},
  annote = {
    { @inproceedings{Parekh2018b,
      author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Quang-Khanh-Ngoc and Perez, Patrick and Richard, Gael},
      booktitle = {CVPR Workshop on Sight and Sound (WSS)},
      month = jun,
      title = {Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events},
      year = 2018,
      address = {Salt Lake City, USA}
    }}
  }
}

WEAKLY SUPERVISED REPRESENTATION LEARNING FOR UNSYNCHRONIZED AUDIO-VISUAL EVENTS

S. Parekh, S. Essid, A. Ozerov, Q. Duong, P. Perez, and G. Richard

Apr 2018

Abs Bib

Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is trained using only video-level event labels without any timing information. An important feature of our method is its capacity to learn from unsynchronized audio-visual events. We achieve state-of-the-art results on a large-scale dataset of weakly-labeled audio event videos. Visualizations of localized visual regions and audio segments substantiate our system s efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously.
@techreport{Parekh2018, archiveprefix = {arXiv}, arxivid = {1804.07345}, number = {arXiv:1804.07345}, author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Quang-Khanh-Ngoc and Perez, Patrick and Richard, Gael}, eprint = {1804.07345}, month = apr, title = {Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events}, url = {http://arxiv.org/abs/1804.07345}, year = {2018} }
STRUCTURED OUTPUT LEARNING WITH ABSTENTION: APPLICATION TO ACCURATE OPINION PREDICTION

A. Garcia, S. Essid, C. Clavel, and F. DAlche-Buc

Mar 2018

Abs Bib

Motivated by Supervised Opinion Analysis, we propose a novel framework devoted to Structured Output Learning with Abstention (SOLA). The structure prediction model is able to abstain from predicting some labels in the structured output at a cost chosen by the user in a flexible way. For that purpose, we decompose the problem into the learning of a pair of predictors, one devoted to structured abstention and the other, to structured output prediction. To compare fully labeled training data with predictions potentially containing abstentions, we define a wide class of asymmetric abstention-aware losses. Learning is achieved by surrogate regression in an appropriate feature space while prediction with abstention is performed by solving a new pre-image problem. Thus, SOLA extends recent ideas about Structured Output Prediction via surrogate problems and calibration theory and enjoys statistical guarantees on the resulting excess risk. Instantiated on a hierarchical abstention-aware loss, SOLA is shown to be relevant for fine-grained opinion mining and gives state-of-the-art results on this task. Moreover, the abstention-aware representations can be used to competitively predict user-review ratings based on a sentence-level opinion predictor.
@techreport{Garcia2018, archiveprefix = {arXiv}, arxivid = {1803.08355}, number = {arXiv:1803.08355}, author = {Garcia, Alexandre and Essid, Slim and Clavel, Chloe and DAlche-Buc, Florence}, eprint = {1803.08355}, month = mar, title = {Structured Output Learning with Abstention: Application to Accurate Opinion Prediction}, url = {http://arxiv.org/abs/1803.08355}, year = {2018} }

AN ENSEMBLE LEARNING APPROACH TO DETECT EPILEPTIC SEIZURES FROM LONG INTRACRANIAL EEG RECORDINGS

J. Schiratti, J. Le Douget, M. Le Van Quyen, S. Essid, and A. Gramfort

In International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , Apr 2018

Bib PDF

@inproceedings{schiratti:hal-01724272,
  title = {An Ensemble Learning Approach to Detect Epileptic Seizures from Long Intracranial EEG Recordings},
  author = {Schiratti, Jean-Baptiste and Le Douget, Jean-Eudes and Le Van Quyen, Michel and Essid, Slim and Gramfort, Alexandre},
  url = {https://hal.archives-ouvertes.fr/hal-01724272},
  booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  address = {Calgary, Canada},
  year = {2018},
  month = apr,
  keywords = {Intracranial EEG ; Epilepsy and seizures ; Machine learning},
}

ATTITUDE CLASSIFICATION IN ADJACENCY PAIRS OF A HUMAN-AGENT INTERACTION WITH HIDDEN CONDITIONAL RANDOM FIELDS

V. Barriere, C. Clavel, and S. Essid

In International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , Apr 2018

Bib

@inproceedings{BARRIERE-18,
  author = {Barriere, Valentin and Clavel, Chloe and Essid, Slim},
  title = {Attitude Classification in Adjacency Pairs of a Human-Agent Interaction with Hidden Conditional Random Fields},
  booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  address = {Calgary, Canada},
  year = {2018},
  month = apr,
  annote = {category=inproceedings language=en audience=2 state=toappear dept=ids group=s2a id=17744}
}

METHOD FOR AUDIO-VISUAL EVENTS CLASSIFICATION AND LOCALIZATION, AND CORRESPONDING APPARATUS, COMPUTER READABLE PROGRAM, PRODUCT AND COMPUTER READABLE STORAGE MEDIUM

S. Parekh, S. Essid, A. Ozerov, Q. Duong, P. Perez, and G. Richard

Patent Application, Apr 2018

Bib

@article{SP:brevet18,
  author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Quang-Khanh-Ngoc and Perez, Patrick and Richard, Gael},
  title = {Method for audio-visual events classification and localization, and corresponding apparatus, computer readable program, product and computer readable storage medium},
  year = {2018},
  month = apr,
  journal = {Patent Application},
  number = {180049},
  annote = {category=brevet language=en state=published dept=ids group=s2a id=18015}
}

2017

MATRIX CO-FACTORISATION AND APPLICATIONS TO MUSIC ANALYSIS

S. Essid

In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML) 2017 , Aug 2017

Bib PDF

@inproceedings{SE:ICML17,
  author = {Essid, Slim},
  title = {Matrix Co-Factorisation and Applications to Music Analysis},
  booktitle = {Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML) 2017},
  address = {Sydney, Australia},
  year = {2017},
  month = aug,
  annote = {category=invite language=en audience=1 state=published dept=ids group=s2a id=18054},
}

NONNEGATIVE FEATURE LEARNING METHODS FOR ACOUSTIC SCENE CLASSIFICATION

V. Bisot, R. Serizel, S. Essid, and G. Richard

In DCASE 2017 - Workshop on Detection and Classification of Acoustic Scenes and Events , Nov 2017

Bib PDF

@inproceedings{bisot:hal-01636627,
  author = {Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gael},
  title = {Nonnegative Feature Learning Methods for Acoustic Scene Classification},
  booktitle = {DCASE 2017 - Workshop on Detection and Classification of Acoustic Scenes and Events},
  address = {Munich, Germany},
  year = {2017},
  month = nov,
  keywords = {Feature learning;Nonnegative Matrix Factorization;Deep Neural Networks},
  annote = {category=inproceedings language=en audience=2 state=published dept=ids group=s2a documentURL=http://biblio.telecom-paristech.fr/cgi-bin/download.cgi?id=17481 id=17481},
}

METHOD FOR PROCESSING AN INPUT AUDIO SIGNAL AND CORRESPONDING ELECTRONIC DEVICE

S. Parekh, S. Essid, A. Ozerov, Q. Duong, P. Perez, and G. Richard

Patent Application, Nov 2017

Bib

@article{SP:brevet17a,
  author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Quang-Khanh-Ngoc and Perez, Patrick and Richard, Gael},
  title = {Method for Processing an input audio signal and corresponding electronic device},
  year = {2017},
  journal = {Patent Application},
  number = {170054},
  annote = {category=article language=en state=published dept=ids group=s2a id=18014}
}

COMPUTATIONAL ANALYSIS OF SOUND SCENES AND EVENTS

R. Serizel, V. Bisot, S. Essid, and G. Richard

Nov 2017

Bib

@inbook{Serizel2017,
  author = {Serizel, Romain and Bisot, Victor and Essid, Slim and Richard, Gael},
  chapter = {Acoustic Features for Environmental Sound Analysis},
  title = {Computational Analysis of Sound Scenes and Events},
  year = {2017},
  editor = {Virtanen, Tuomas and Ellis, Dan and Plumbley, Mark},
  publisher = {Springer International Publishing AG}
}

COMPUTATIONAL ANALYSIS OF SOUND SCENES AND EVENTS

S. Essid, S. Parekh, Q. Duong, A. Ozerov, and R. Serizel

Nov 2017

Bib

@inbook{Essid2017,
  author = {Essid, Slim and Parekh, Sanjeel and Duong, Quang-Khanh-Ngoc and Ozerov, Alexey and Serizel, Romain},
  chapter = {Multiview Approaches to Event Detection and Scene Analysis},
  title = {Computational Analysis of Sound Scenes and Events},
  year = {2017},
  editor = {Virtanen, Tuomas and Ellis, Dan and Plumbley, Mark},
  publisher = {Springer International Publishing AG}
}

UE-HRI: A NEW DATASET FOR THE STUDY OF USER ENGAGEMENT IN SPONTANEOUS HUMAN-ROBOT INTERACTIONS

A. Ben Youssef, C. Clavel, S. Essid, M. Bilac, M. Chamoux, and A. Lim

In ACM International Conference on Multimodal Interaction , Nov 2017

Bib PDF

@inproceedings{Benyoussef2017,
  author = {Ben Youssef, Atef and Clavel, Chloe and Essid, Slim and Bilac, Miriam and Chamoux, Marine and Lim, Angelica},
  booktitle = {ACM International Conference on Multimodal Interaction},
  title = {UE-HRI: A New Dataset for the Study of User Engagement in Spontaneous Human-Robot Interactions},
  year = {2017},
  month = nov,
  address = {Glasgow, Scotland},
}

LEVERAGING DEEP NEURAL NETWORKS WITH NONNEGATIVE REPRESENTATIONS FOR IMPROVED ENVIRONMENTAL SOUND CLASSIFICATION

V. Bisot, R. Serizel, S. Essid, and G. Richard

In IEEE International Workshop on Machine Learning for Signal Processing (MLSP) , Sep 2017

Bib PDF

@inproceedings{Bisot2017c,
  author = {Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gael},
  booktitle = {IEEE International Workshop on	Machine Learning for Signal Processing (MLSP)},
  title = {Leveraging Deep Neural Networks with Nonnegative Representations for Improved Environmental Sound Classification},
  year = {2017},
  month = sep,
  address = {Tokyo, Japan},
}

GUIDING AUDIO SOURCE SEPARATION BY VIDEO OBJECT INFORMATION

S. Parekh, S. Essid, A. Ozerov, Q. Duong, P. Perez, and G. Richard

In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , Oct 2017

Bib PDF

@inproceedings{Parekh2017b,
  address = {New Paltz, New York, U.S.A},
  author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Quang-Khanh-Ngoc and Perez, Patrick and Richard, Gael},
  booktitle = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  month = oct,
  title = {Guiding Audio Source Separation by Video Object Information},
  year = {2017}
}

EMOEEG: A NEW MULTIMODAL DATASET FOR DYNAMIC EEG-BASED EMOTION RECOGNITION WITH AUDIOVISUAL ELICITATION

A. Conneau, A. Hajlaoui, M. Chetouani, and S. Essid

In The European Signal Processing Conference (EUSIPCO) , Oct 2017

Bib PDF

@inproceedings{Conneau2017,
  address = {Kos island, Greece},
  author = {Conneau, Anne-Claire and Hajlaoui, Ayoub and Chetouani, Mohamed and Essid, Slim},
  booktitle = {The European Signal Processing Conference (EUSIPCO)},
  title = {EMOEEG: a New Multimodal Dataset for Dynamic EEG-based Emotion Recognition with Audiovisual Elicitation},
  year = {2017}
}

OPINION DYNAMICS MODELING FOR MOVIE REVIEW TRANSCRIPTS

V. Barriere, C. Clavel, and S. Essid

In Interspeech , Oct 2017

Bib PDF

@inproceedings{Barriere2017,
  address = {Stockholm, Sweeden},
  author = {Barriere, Valentin and Clavel, Chloe and Essid, Slim},
  booktitle = {Interspeech},
  title = {Opinion Dynamics Modeling for Movie Review Transcripts},
  year = {2017},
}

OVERLAPPING SOUND EVENT DETECTION WITH SUPERVISED NONNEGATIVE MATRIX FACTORIZATION

V. Bisot, S. Essid, and G. Richard

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Oct 2017

Bib PDF

@inproceedings{Bisot2017b,
  address = {New Orleans, USA},
  author = {Bisot, Victor and Essid, Slim and Richard, Gael},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title = {Overlapping Sound Event Detection with Supervised Nonnegative Matrix Factorization},
  year = {2017},
}

FEATURE LEARNING WITH MATRIX FACTORIZATION APPLIED TO ACOUSTIC SCENE CLASSIFICATION

V. Bisot, R. Serizel, S. Essid, and G. Richard

IEEE Transactions on Audio, Speech, and Language Processing (TASLP), Oct 2017

Bib PDF

@article{Bisot2017,
  author = {Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gael},
  journal = {IEEE Transactions on Audio, Speech, and Language Processing (TASLP)},
  title = {Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification},
  year = {2017},
}

SUPERVISED GROUP NONNEGATIVE MATRIX FACTORISATION WITH SIMILARITY CONSTRAINTS AND APPLICATIONS TO SPEAKER IDENTIFICATION

R. Serizel, V. Bisot, S. Essid, and G. Richard

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Oct 2017

Bib PDF

@inproceedings{Serizel2018,
  address = {New Orleans, USA},
  author = {Serizel, Romain and Bisot, Victor and Essid, Slim and Richard, Gael},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title = {Supervised Group Nonnegative Matrix Factorisation with Similarity Constraints and Applications to Speaker Identification},
  year = {2017},
}

MOTION INFORMED AUDIO SOURCE SEPARATION

S. Parekh, S. Essid, A. Ozerov, Q. Duong, P. Perez, and G. Richard

In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) , Oct 2017

Bib PDF

@inproceedings{Parekh2017,
  address = {New Orleans, USA},
  author = {Parekh, Sanjeel and Essid, Slim and Ozerov, Alexey and Duong, Quang-Khanh-Ngoc and Perez, Patrick and Richard, Gael},
  booktitle = {IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP)},
  title = {Motion Informed Audio Source Separation},
  year = {2017},
}

2016

DISPOSITIF A CASQUE AUDIO PERFECTIONNE

S. Essid, and R. Blouet

Patent Application, Nov 2016

Bib PDF

@article{SE:patent16,
  author = {Essid, Slim and Blouet, Raphael},
  title = {Dispositif a Casque Audio Perfectionne},
  year = {2016},
  month = nov,
  journal = {Patent Application},
  number = {1661324}
}

SUPERVISED NONNEGATIVE MATRIX FACTORIZATION FOR ACOUSTIC SCENE CLASSIFICATION

V. Bisot, R. Serizel, S. Essid, and G. Richard

In IEEE international evaluation campaign on detection and classification of acousitc scenes and events (DCASE 2016) , Sep 2016

Bib PDF

@inproceedings{Bisot2016b,
  author = {Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gael},
  keywords = {Acoustic Scene Classification ; Feature learning ;},
  booktitle = {IEEE international evaluation campaign on detection and classification of acousitc scenes and events (DCASE 2016)},
  month = sep,
  title = {Supervised nonnegative matrix factorization for acoustic scene classification},
  year = {2016},
}

DOWNBEAT DETECTION WITH CONDITIONAL RANDOM FIELDS AND DEEP LEARNED FEATURES

S. Durand, and S. Essid

In The 17th International Society for Music Information Retrieval Conference (ISMIR) , Aug 2016

Bib PDF

@inproceedings{SD:ISMIR-16,
  title = {Downbeat Detection with Conditional Random Fields and Deep Learned Features},
  author = {Durand, Simon and Essid, Slim},
  booktitle = {The 17th International Society for Music Information Retrieval Conference (ISMIR)},
  month = aug,
  year = {2016},
  address = {New York City, USA},
}

MINI-BATCH STOCHASTIC APPROACHES FOR ACCELERATED MULTIPLICATIVE UPDATES IN NONNEGATIVE MATRIX FACTORISATION WITH BETA-DIVERGENCE

R. Serizel, S. Essid, and G. Richard

In IEEE International Workshop on Machine Learning for Signal Processing (MLSP) , Sep 2016

Bib PDF

@inproceedings{RS:MLSP-2016,
  author = {Serizel, Romain and Essid, Slim and Richard, Gael},
  title = {Mini-batch stochastic approaches for accelerated multiplicative updates in nonnegative matrix factorisation with beta-divergence},
  booktitle = {IEEE International Workshop on Machine Learning for Signal Processing (MLSP)},
  year = {2016},
  month = sep,
  keywords = {Nonnegative matrix factorisation, GPGPU, multiplicative rules, online learning},
  annote = {category=inproceedings language=en audience=2 state=submitted dept=tsi group=aao      documentURL=http://biblio.telecom-paristech.fr/cgi-bin/download.cgi?id=16155 id=16155},
}

MACHINE LISTENING TECHNIQUES AS A COMPLEMENT TO VIDEO IMAGE ANALYSIS IN FORENSICS

R. Serizel, V. Bisot, S. Essid, and G. Richard

In The International Conference on Image Processing (ICIP) , Oct 2016

Bib PDF

@inproceedings{RS:ICIP-2016,
  author = {Serizel, Romain and Bisot, Victor and Essid, Slim and Richard, Gael},
  title = {Machine listening techniques as a complement to video image analysis in forensics},
  booktitle = {The International Conference on Image Processing (ICIP)},
  year = {2016},
  month = oct,
  annote = {category=inproceedings language=en audience=2 state=toappear dept=tsi group=aao documentURL=http://biblio.telecom-paristech.fr/cgi-bin/download.cgi?id=16154 id=16154},
}

ACOUSTIC SCENE CLASSIFICATION WITH MATRIX FACTORIZATION FOR UNSUPERVISED FEATURE LEARNING

V. Bisot, R. Serizel, S. Essid, and G. Richard

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Mar 2016

Bib PDF

@inproceedings{Bisot2016,
  author = {Bisot, Victor and Serizel, Romain and Essid, Slim and Richard, Gael},
  title = {Acoustic scene classification with matrix factorization for unsupervised feature learning},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  address = {Shanghai, China},
  year = {2016},
  month = mar,
  keywords = {Acoustic scene classification, unsupervised feature learning,   matrix factorization},
  annote = {category=inproceedings language=en audience=2 state=toappear dept=tsi group=aao id=15948},
}

GROUP NONNEGATIVE MATRIX FACTORISATION WITH SPEAKER AND SESSION VARIABILITY COMPENSATION FOR SPEAKER IDENTIFICATION

R. Serizel, S. Essid, and G. Richard

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Mar 2016

Bib PDF

@inproceedings{Serizel2016a,
  author = {Serizel, Romain and Essid, Slim and Richard, Gael},
  title = {Group nonnegative matrix factorisation with speaker and session variability compensation for speaker identification},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  address = {Shanghai, China},
  year = {2016},
  month = mar,
  keywords = {Nonnegative matrix factorisation, spectrogram factorisation, feature learning, speaker variability, speaker identification},
  annote = {category=inproceedings language=en audience=2 state=toappear dept=tsi group=aao documentURL=http://biblio.telecom-paristech.fr/cgi-bin/download.cgi?id=15957 id=15957},
}

2015

CONTRIBUTIONS IN MACHINE LEARNING FOR MULTIMODAL DATA ANALYSIS: METHODS, ALGORITHMS AND SYSTEMS FOR TEMPORALLY STRUCTURED DATA

S. Essid

Université Pierre et Marie Curie , Sep 2015

Habilitation Thesis

Bib

@phdthesis{SE:hdr,
  author = {Essid, Slim},
  title = {Contributions in Machine Learning for Multimodal Data Analysis: Methods, Algorithms and Systems for Temporally Structured Data},
  school = {Universit\&eacute; Pierre et Marie Curie},
  year = {2015},
  note = {Habilitation Thesis},
  month = sep
}

TPT-DANCE&ACTIONS : UN CORPUS MULTIMODAL D’ACTIVITES HUMAINES

A. Masurelle, A. Sekkat, S. Essid, and G. Richard

Revue Traitement du Signal, Sep 2015

Bib PDF

@article{Masurelle2015,
  title = {TPT-Dance&Actions : un corpus multimodal d’activites humaines},
  number = {4},
  journal = {Revue Traitement du Signal},
  author = {Masurelle, Aymeric and Sekkat, A. Rida and Essid, Slim and Richard, Gael},
  year = {2015},
}

MELODY EXTRACTION BY CONTOUR CLASSIFICATION

R. Bittner, J. Salmon, S. Essid, and J. Bello

In International Conference on Music Information Retrieval (ISMIR) , Sep 2015

Bib PDF

@inproceedings{Bittner2015,
  author = {Bittner, Rachel and Salmon, Justin and Essid, Slim and Bello, J. P.},
  booktitle = {International Conference on Music Information Retrieval (ISMIR)},
  title = {Melody Extraction by Contour Classification},
  address = {Malaga, Spain},
  year = {2015},
}

HOG AND SUBBAND POWER DISTRIBUTION IMAGE FEATURES FOR ACOUSTIC SCENE CLASSIFICATION

V. Bisot, S. Essid, and G. Richard

In European Signal Processing Conference (EUSIPCO) , Sep 2015

Bib PDF

@inproceedings{Bisot2015,
  author = {Bisot, Victor and Essid, Slim and Richard, Gael},
  booktitle = {European Signal Processing Conference (EUSIPCO)},
  title = {HOG and subband power distribution image features for acoustic scene classification},
  address = {Nice, France},
  year = {2015},
}

A CONDITIONAL RANDOM FIELD SYSTEM FOR BEAT TRACKING

T. Fillon, C. Joder, S. Durand, and S. Essid

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Apr 2015

Bib PDF

@inproceedings{TF:ICASSP-15,
  author = {Fillon, Thomas and Joder, Cyril and Durand, Simon and Essid, Slim},
  title = {A Conditional Random Field System for Beat Tracking},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  address = {Brisbane, Australia},
  year = {2015},
  month = apr,
  annote = {category=inproceedings language=en audience=2 state=toappear dept=tsi group=aao id=15366}
}

2014

SOFT NONNEGATIVE MATRIX CO-FACTORIZATION

N. Seichepine, S. Essid, C. Fevotte, and O. Cappe

IEEE Transactions on Signal Processing, Apr 2014

Bib PDF

@article{Seichepine_Essid_Fevotte_Cappe_2014,
  title = {Soft nonnegative matrix co-factorization},
  volume = {PP},
  url = {http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6908018},
  doi = {10.1109/TSP.2014.2360141},
  abstractnote = {This work introduces a new framework for nonnegative matrix factorization (NMF) in multisensor or multimodal data configurations, where taking into account the mutual dependence that exists between the related parallel streams of data is expected to improve performance. In contrast with previous works that focused on co-factorization methods  where some factors are shared by the different modalities we propose a soft co-factorization scheme which accounts for possible local discrepancies across modalities or channels. This objective is formalized as an optimization problem where concurrent factorizations are jointly performed while being tied by a coupling term that penalizes differences between the related factor matrices associated with different modalities. We provide majorization-minimization (MM) algorithms for three common measures of fit, the squared Euclidean norm, the Kullback-Leibler divergence and the Itakura-Saito divergence, and two possible coupling variants, using either the l1 or the squared Euclidean norm of differences. The approach is shown to achieve promising performance in two audio-related tasks: multimodal speaker diarization using audiovisual data and audio source separation using stereo data.},
  number = {99},
  journal = {IEEE Transactions on Signal Processing},
  author = {Seichepine, Nicolas and Essid, Slim and Fevotte, Cedric and Cappe, Olivier},
  year = {2014},
}

PIECEWISE CONSTANT NONNEGATIVE MATRIX FACTORIZATION

N. Seichepine, S. Essid, C. Fevotte, and O. Cappe

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , May 2014

Bib PDF

@inproceedings{NS:ICASSP-14,
  author = {Seichepine, Nicolas and Essid, Slim and Fevotte, Cedric and Cappe, Olivier},
  title = {Piecewise constant Nonnegative matrix factorization},
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  address = {Florence, Italy},
  year = {2014},
  month = may,
}

ASSESSMENT OF NEW SPECTRAL FEATURES FOR EEG-BASED EMOTION RECOGNITION

A. Conneau, and S. Essid

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , May 2014

Bib PDF

@inproceedings{AC:ICASSP-14,
  author = {Conneau, Anne-Claire and Essid, Slim},
  title = {Assessment of new spectral features for EEG-based emotion recognition},
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  address = {Florence, Italy},
  year = {2014},
  month = may,
}

GESTURE RECOGNITION USING A NMF-BASED REPRESENTATION OF MOTION-TRACES EXTRACTED FROM DEPTH SILHOUETTES

A. Masurelle, S. Essid, and G. Richard

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , May 2014

Bib PDF

@inproceedings{AM:ICASSP-14,
  author = {Masurelle, Aymeric and Essid, Slim and Richard, Gael},
  title = {Gesture recognition using a NMF-based representation of motion-traces extracted from depth silhouettes},
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  address = {Florence, Italy},
  year = {2014},
  month = may,
}

2013

CO-FACTORISATION DOUCE EN MATRICES NON-NEGATIVES. APPLICATION AU REGROUPEMENT MULTIMODAL DE LOCUTEURS

N. Seichepine, S. Essid, C. Fevotte, and O. Cappe

In GRETSI , Sep 2013

Bib

@inproceedings{gretsi-13,
  author = {Seichepine, Nicolas and Essid, Slim and Fevotte, Cedric and Cappe, Olivier},
  title = {Co-factorisation douce en matrices non-negatives. Application au regroupement multimodal de locuteurs},
  booktitle = {GRETSI},
  address = {Brest, France},
  year = {2013},
  month = sep,
  keywords = {NMF, multimodal},
  annote = {category=inproceedings language=fr audience=1 state=toappear dept=tsi group=aao,sta id=14269}
}

NONNEGATIVE TENSOR FACTORIZATION FOR SINGLE-CHANNEL EEG ARTIFACT REJECTION

C. Damon, A. Liutkus, A. Gramfort, and S. Essid

In IEEE International Workshop on Machine Learning for Signal Processing , Sep 2013

Bib PDF

@inproceedings{CD:MLSP-13,
  author = {Damon, Cecilia and Liutkus, Antoine and Gramfort, Alexandre and Essid, Slim},
  title = {Nonnegative Tensor Factorization for Single-Channel EEG Artifact Rejection},
  booktitle = {IEEE International Workshop on Machine Learning for Signal Processing},
  address = {Southampton, UK},
  year = {2013},
  month = sep,
  keywords = {EEG, NTF, NMF},
}

EXPLORING NEW FEATURES FOR MUSIC CLASSIFICATION

R. Foucard, S. Essid, G. Richard, and M. Lagrange

In International Workshop on Image and Audio Analysis for Multimedia Interactive Services (WIAMIS) , Jul 2013

Bib PDF

@inproceedings{RF:Wiamis2013,
  author = {Foucard, Remi and Essid, Slim and Richard, Gael and Lagrange, Mathieu},
  title = {Exploring new features for music classification},
  booktitle = {International Workshop on Image and Audio Analysis for Multimedia Interactive Services (WIAMIS)},
  address = {Paris, France},
  year = {2013},
  month = jul,
}

MULTIMODAL CLASSIFICATION OF DANCE MOVEMENTS USING BODY JOINT TRAJECTORIES AND STEP SOUNDS

A. Masurelle, S. Essid, and G. Richard

In International Workshop on Image and Audio Analysis for Multimedia Interactive Services (WIAMIS) , Jul 2013

Bib PDF

@inproceedings{AM:WIAMIS-13,
  author = {Masurelle, Aymeric and Essid, Slim and Richard, Gael},
  title = {Multimodal classification of dance movements using body joint trajectories and step sounds},
  booktitle = {International Workshop on Image and Audio Analysis for Multimedia Interactive Services (WIAMIS)},
  address = {Paris, France},
  year = {2013},
  month = jul,
}

PROBABILISTIC DANCE PERFORMANCE ALIGNMENT BY FUSION OF MULTIMODAL FEATURES

A. Dremeau, and S. Essid

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , May 2013

Bib PDF

@inproceedings{Dremeau2013a,
  author = {Dremeau, Angelique and Essid, Slim},
  title = {Probabilistic dance performance alignment by fusion of multimodal features },
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  address = {Vancouver, Canada},
  year = {2013},
  month = may,
  annote = {category=inproceedings language=en audience=2 state=toappear dept=tsi group=aao id=13252},
}

SOFT NONNEGATIVE MATRIX CO-FACTORIZATION WITH APPLICATION TO MULTIMODAL SPEAKER DIARIZATION

N. Seichepine, S. Essid, C. Fevotte, and O. Cappe

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , May 2013

Bib PDF

@inproceedings{ICASSP-13-SEICHEPINE,
  author = {Seichepine, Nicolas and Essid, Slim and Fevotte, Cedric and Cappe, Olivier},
  title = {Soft Nonnegative Matrix Co-factorization with Application to Multimodal Speaker Diarization},
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  address = {Vancouver},
  year = {2013},
  month = may,
  keywords = {Nonnegative matrix factorization, cofactorization, multimodality, speaker diarization},
  annote = {category=inproceedings language=en audience=2 state=toappear dept=tsi group=aao,sta id=13313},
}

NONNEGATIVE MATRIX FACTORIZATION FOR SINGLE-CHANNEL EEG ARTIFACT REJECTION

C. Damon, A. Liutkus, A. Gramfort, and S. Essid

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , May 2013

Bib PDF

@inproceedings{CD:ICASSP-13,
  author = {Damon, Cecilia and Liutkus, Antoine and Gramfort, Alexandre and Essid, Slim},
  title = {Nonnegative Matrix Factorization for Single-Channel EEG Artifact Rejection},
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) },
  address = {Vancouver, Canada},
  year = {2013},
  month = may,
  annote = {category=inproceedings language=en audience=2 state=toappear dept=tsi group=aao id=13264},
}

A MULTIMODAL APPROACH TO SPEAKER DIARIZATION ON TV TALK-SHOWS

F. Vallet, S. Essid, and J. Carrive

IEEE Transactions on Multimedia, May 2013

Abs Bib PDF

In this article, we propose solutions to the problem of speaker diarization of TV talk-shows, a problem for which adapted multimodal approaches, relying on other streams of data than only audio, remain largely under exploited. Hence we propose an original system that leverages prior knowledge on the structure of this type of content, especially the visual information relating to the active speakers, for an improved diarization performance. The architecture of this system can be decomposed into two main stages. First a reliable training set is created, in an unsupervised fashion, for each participant of the TV program being processed. This data is assembled by the association of visual and audio descriptors carefully selected in a clustering cascade. Then, Support Vector Machines are used for the classification of the speech data (of a given TV program). The performance of this new architecture is assessed on two French talk-show collections: Le Grand Échiquier and On na pas tout dit. The results show that our new system outperforms state-of-the-art methods, thus evidencing the effectiveness of kernel-based methods, as well as visual cues, in multimodal approaches to speaker diarization of challenging contents such as TV talk-shows.
@article{FV:TM-12, author = {Vallet, F. and Essid, Slim and Carrive, J.}, journal = {IEEE Transactions on Multimedia}, title = {A Multimodal Approach to Speaker Diarization on TV Talk-Shows}, year = {2013}, volume = {15}, number = {3}, pages = {509-520}, keywords = {speaker recognition;support vector machines;French talk show collection;TV program;TV talk shows;audio descriptors;clustering cascade;diarization performance;kernel based method;multimodal approach;original system;reliable training set;speaker diarization;speech data;support vector machines;visual descriptors;Cameras;Databases;Microphones;NIST;Speech;TV;Visualization;Fusion;SVM classification;joint audiovisual processing;multimodality;speaker diarization;talk-show;unsupervised learning}, doi = {10.1109/TMM.2012.2233724}, issn = {1520-9210}, }

LEARNING OPTIMAL FEATURES FOR POLYPHONIC AUDIO-TO-SCORE ALIGNMENT

C. Joder, S. Essid, and G. Richard

IEEE Transactions on Audio, Speech, and Language Processing, May 2013

Bib PDF

@article{CJ:TASLP-13,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  journal = {IEEE Transactions on Audio, Speech, and Language Processing},
  title = {Learning Optimal Features for Polyphonic Audio-to-Score Alignment},
  year = {2013},
  volume = {21},
  number = {10},
  pages = {2118-2128},
  keywords = {audio signal processing;maximum likelihood estimation;signal representation;CQT-based representation;audio observations;conditional random fields model;discriminative framework;feature functions design;heuristic mappings;learning optimal features;linear transformation;maximum likelihood criterion;musical recording;polyphonic audio-to-score alignment;polyphonic music;spectrogram;symbolic representation;symmetric Kull-back-Leibler divergence;template construction;template vectors;temporal constraints;Music information retrieval;audio-to-score alignment;conditional random fields;discriminative learning},
  doi = {10.1109/TASL.2013.2266794},
  issn = {1558-7916},
}

SMOOTH NONNEGATIVE MATRIX FACTORIZATION FOR UNSUPERVISED AUDIOVISUAL DOCUMENT STRUCTURING

S. Essid, and C. Fevotte

IEEE Transactions on Multimedia, May 2013

Bib PDF

@article{SE:TMM-12,
  author = {Essid, Slim and Fevotte, Cedric},
  journal = {IEEE Transactions on Multimedia},
  title = {Smooth Nonnegative Matrix Factorization for Unsupervised Audiovisual Document Structuring},
  year = {2013},
  volume = {15},
  number = {2},
  pages = {415-425},
  keywords = {audio signal processing;document handling;hidden Markov models;matrix decomposition;minimisation;unsupervised learning;video databases;video signal processing;Kullback-Leibler divergence;NMF algorithm;audio modality;audio speaker diarization;cost function;hidden Markov model;histogram-of-count;latent structuring pattern;majorization-minimization technique;person-oriented video structuring task;political debate video database;smooth nonnegative matrix factorization;temporal smoothness constraint;unsupervised audiovisual document structuring;visual modality;Data models;Feature extraction;Histograms;Indexing;Telecommunications;Visualization;Vocabulary;Bag of features;content structuring;indexing;machine learning;matrix factorization;unsupervised classification;videos},
  doi = {10.1109/TMM.2012.2228474},
  issn = {1520-9210},
}

2012

ANALYSIS OF DANCE MOVEMENTS USING GAUSSIAN PROCESSES

A. Liutkus, A. Dremeau, D. Alexiadis, S. Essid, and P. Daras

In ACM Multimedia , Nov 2012

Bib

@inproceedings{AL:ACM-12,
  author = {Liutkus, Antoine and Dremeau, Angelique and Alexiadis, D. and Essid, Slim and Daras, Petros},
  title = {Analysis of dance movements using Gaussian processes},
  booktitle = {ACM Multimedia},
  address = {Nara, Japan},
  year = {2012},
  month = nov,
  url = {http://hal.inria.fr/hal-00718791}
}

DECOMPOSING THE VIDEO EDITING STRUCTURE OF A TALK-SHOW USING NONNEGATIVE MATRIX FACTORIZATION

S. Essid, and C. Fevotte

In International Conference on Image Processing (ICIP) , Oct 2012

Bib

@inproceedings{SE:ICIP-12,
  author = {Essid, Slim and Fevotte, Cedric},
  title = {Decomposing the Video Editing Structure of a Talk-show using Nonnegative Matrix Factorization},
  booktitle = {International Conference on Image Processing (ICIP)},
  address = {Orlando, FL, USA},
  year = {2012},
  month = oct,
  annote = {category=inproceedings language=en audience=2 state=toappear dept=tsi group=aao id=12528}
}

MULTIMODAL MUSIC PROCESSING

S. Essid, and G. Richard

Oct 2012

Bib PDF

@inbook{SE:DFU-12,
  author = {Essid, Slim and Richard, Gael},
  chapter = {Fusion of Multimodal Information in Music Content Analysis},
  title = {Multimodal Music Processing},
  pages = {37--52},
  series = {Dagstuhl Follow-Ups},
  isbn = {978-3-939897-37-8},
  issn = {1868-8977},
  year = {2012},
  volume = {3},
  editor = {Muller, Meinard and Goto, Masataka and Schedl, Markus},
  publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address = {Dagstuhl, Germany},
  url = {http://drops.dagstuhl.de/opus/volltexte/2012/3465},
  urn = {urn:nbn:de:0030-drops-34652},
  doi = {http://dx.doi.org/10.4230/DFU.Vol3.11041.37},
  annote = {Keywords: Multimodal music processing, music signals indexing and transcription, information fusion, audio, video},
}

A MULTI-MODAL DANCE CORPUS FOR RESEARCH INTO INTERACTION BETWEEN HUMANS IN VIRTUAL ENVIRONMENTS

S. Essid, X. Lin, M. Gowing, G. Kordelas, A. Aksay, P. Kelly, T. Fillon, Q. Zhang, A. Dielmann, V. Kitanovski, R. Tournemenne, A. Masurelle, E. Izquierdo, N. OConnor, P. Daras, and G. Richard

Journal on Multimodal User Interfaces: Special issue on multimodal corpora, Oct 2012

Bib PDF

@article{SE:ICMI-12,
  author = {Essid, Slim and Lin, X. and Gowing, M. and Kordelas, G. and Aksay, A. and Kelly, P. and Fillon, Thomas and Zhang, Q. and Dielmann, A. and Kitanovski, V. and Tournemenne, R. and Masurelle, Aymeric and Izquierdo, E. and OConnor, N. E. and Daras, Petros and Richard, Gael},
  title = {A multi-modal dance corpus for research into interaction between humans in virtual environments},
  journal = {Journal on Multimodal User Interfaces: Special issue on multimodal corpora},
  year = {2012},
}

AN ADVANCED VIRTUAL DANCE PERFORMANCE EVALUATOR

S. Essid, D. Alexiadis, R. Tournemenne, M. Gowing, P. Kelly, D. Monhagan, P. Daras, A. Dremeau, and N. OConnor

In IEEE International Conference on Acoustics, Speech and Signal Processing , Mar 2012

Bib PDF

@inproceedings{SE:ICASSP-12b,
  author = {Essid, Slim and Alexiadis, D. and Tournemenne, R. and Gowing, M. and Kelly, P. and Monhagan, D. and Daras, Petros and Dremeau, Angelique and OConnor, N. E.},
  title = {An Advanced Virtual Dance Performance Evaluator},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing},
  address = {Kyoto, Japan},
  year = {2012},
  month = mar,
  annote = {category=inproceedings language=en audience=2 state=toappear project=audiosig dept=tsi group=aao id=12271},
}

A SINGLE-CLASS SVM BASED ALGORITHM FOR COMPUTING AN IDENTIFIABLE NMF

S. Essid

In IEEE International Conference on Acoustics, Speech and Signal Processing , Mar 2012

Bib PDF

@inproceedings{SE:ICASSP-12,
  author = {Essid, Slim},
  title = {A Single-class SVM Based Algorithm For Computing An Identifiable Nmf},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing},
  address = {Kyoto, Japan},
  year = {2012},
  month = mar,
  annote = {category=inproceedings language=en audience=2 state=toappear project=audiosig dept=tsi group=aao id=12270},
}

A REGRESSIVE BOOSTING APPROACH TO AUTOMATIC AUDIO TAGGING BASED ON SOFT ANNOTATOR FUSION

R. Foucard, S. Essid, M. Lagrange, and G. Richard

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Mar 2012

Bib PDF

@inproceedings{RF:ICASSP-12,
  author = {Foucard, Remi and Essid, Slim and Lagrange, Mathieu and Richard, Gael},
  title = {A Regressive Boosting Approach To Automatic Audio Tagging Based On Soft Annotator Fusion},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  address = {Kyoto, Japan},
  year = {2012},
  month = mar,
  annote = {category=inproceedings language=en audience=2 state=toappear project=audiosig dept=tsi group=aao id=12269},
}

2011

A MULTIMODAL DANCE CORPUS FOR RESEARCH INTO REAL-TIME INTERACTION BETWEEN HUMANS IN ONLINE VIRTUAL ENVIRONMENTS

S. Essid, X. Lin, M. Gowing, G. Kordelas, A. Aksay, P. Kelly, T. Fillon, Q. Zhang, A. Dielmann, V. Kitanovski, R. Tournemenne, N. OConnor, P. Daras, and G. Richard

In ICMI Workshop On Multimodal Corpora For Machine Learning , Nov 2011

Bib PDF

@inproceedings{SE:ICMI-11,
  author = {Essid, Slim and Lin, X. and Gowing, M. and Kordelas, G. and Aksay, A. and Kelly, P. and Fillon, Thomas and Zhang, Q. and Dielmann, A. and Kitanovski, V. and Tournemenne, R. and OConnor, N. E. and Daras, Petros and Richard, Gael},
  title = {A multimodal dance corpus for research into real-time interaction between humans in online virtual environments },
  booktitle = {ICMI Workshop On Multimodal Corpora For Machine Learning},
  address = {Alicante, Spain},
  year = {2011},
  month = nov,
  annote = {category=inproceedings language=en audience=2 state=published project=audiosig dept=tsi group=aao id=12272},
}

AN AUDIO-DRIVEN VIRTUAL DANCE-TEACHING ASSISTANT

S. Essid, Y. Grenier, M. Maazaoui, G. Richard, and R. Tournemenne

In ACM Multimedia , Nov 2011

Bib PDF

@inproceedings{SE:ACM-MM-GC-2011,
  author = {Essid, Slim and Grenier, Yves and Maazaoui, Mounira and Richard, Gael and Tournemenne, R.},
  title = {An audio-driven virtual dance-teaching assistant},
  booktitle = {ACM Multimedia},
  address = {Scottsdale, Arizona, USA},
  year = {2011},
  month = nov,
  annote = {category=inproceedings language=fr audience=1 state=toappear project= dept=tsi group=aao id=11561}
}

ENHANCED VISUALISATION OF DANCE PERFORMANCE FROM AUTOMATICALLY SYNCHRONISED MULTIMODAL RECORDINGS

M. Gowing, P. Kelly, N. OConnor, E. Izquierdo, V. Kitanovski, X. Lin, Q. Zhang, C. Concolato, S. Essid, J. Feuvre, and R. Tournemenne

In ACM Multimedia , Nov 2011

Bib PDF

@inproceedings{CC:ACM-MM-GC-2011,
  author = {Gowing, M. and Kelly, P. and OConnor, N. E. and Izquierdo, E. and Kitanovski, V. and Lin, X. and Zhang, Q. and Concolato, C. and Essid, Slim and Feuvre, J. Le and Tournemenne, R.},
  title = {Enhanced Visualisation of Dance Performance from Automatically Synchronised Multimodal Recordings},
  booktitle = {ACM Multimedia},
  address = {Scottsdale, Arizona, USA},
  year = {2011},
  month = nov,
  annote = {category=inproceedings language=en audience=2 state=toappear project=m2 dept=tsi group=aao,mm id=11552}
}

AN INTERACTIVE SYSTEM FOR ELECTRO-ACOUSTIC MUSIC ANALYSIS

S. Gulluni, S. Essid, O. Buisson, and G. Richard

In International Conference on Music Information Retrieval (ISMIR) , Oct 2011

Bib PDF

@inproceedings{SG:ISMIR-11,
  author = {Gulluni, S. and Essid, Slim and Buisson, O. and Richard, Gael},
  title = {An interactive system for electro-acoustic music analysis},
  booktitle = {International Conference on Music Information Retrieval (ISMIR)},
  year = {2011},
  address = {Miami, U.S.A},
  month = oct,
}

MULTI-SCALE TEMPORAL FUSION BY BOOSTING FOR MUSIC CLASSIFICATION

R. Foucard, S. Essid, M. Lagrange, and G. Richard

In International Conference on Music Information Retrieval (ISMIR) , Oct 2011

Bib PDF

@inproceedings{RF:ISMIR-11,
  author = {Foucard, Remi and Essid, Slim and Lagrange, Mathieu and Richard, Gael},
  title = {Multi-scale temporal fusion by boosting for music classification},
  booktitle = {International Conference on Music Information Retrieval (ISMIR)},
  year = {2011},
  address = {Miami, U.S.A},
  month = oct,
}

OPTIMIZING THE MAPPING FROM A SYMBOLIC TO AN AUDIO REPRESENTATION FOR MUSIC-TO-SCORE ALIGNMENT

C. Joder, S. Essid, and G. Richard

In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , Oct 2011

Bib PDF

@inproceedings{CJ:WASPAA-11,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {Optimizing the Mapping from a Symbolic to an Audio Representation for Music-to-Score Alignment},
  booktitle = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  year = {2011},
  address = {New Paltz, New York, U.S.A},
  month = oct,
}

NONNEGATIVE MATRIX FACTORIZATION FOR UNSUPERVISED AUDIOVISUAL DOCUMENT STRUCTURING

S. Essid, and C. Fevotte

Oct 2011

Bib PDF

@techreport{SE:TREP-11,
  author = {Essid, Slim and Fevotte, Cedric},
  title = {Nonnegative matrix factorization for unsupervised audiovisual document structuring},
  year = {2011},
  institution = {HAL},
  number = {hal-00605886},
  url = {http://hal.archives-ouvertes.fr/hal-00605886/en/}
}

SEMANTIQUE ET MULTIMODALITE EN ANALYSE DE L INFORMATION

G. Adda, G. Chollet, S. Essid, T. Fillon, M. Garnier-Rizet, C. Hory, and L. Beltaifa-Zouari

Oct 2011

Bib

@inbook{GA:IFMGBOOK,
  chapter = {Traitement des modalites "audio" et "parole"},
  title = {Semantique et multimodalite en analyse de l information},
  publisher = {Hermes/Lavoisier},
  year = {2011},
  editor = {Campedel, Marine and Hoogstel, Pierre},
  author = {Adda, G. and Chollet, G. and Essid, Slim and Fillon, Thomas and Garnier-Rizet, M. and Hory, C. and Beltaifa-Zouari, L.},
  owner = {essid},
  timestamp = {2011.02.28}
}

TV CONTENT ANALYSIS: TECHNIQUES AND APPLICATIONS

F. Vallet, S. Essid, J. Carrive, and G. Richard

Oct 2011

Bib

@inbook{FV:TVBOOK-11,
  author = {Vallet, F. and Essid, Slim and Carrive, J. and Richard, G.},
  title = {TV Content Analysis: Techniques and Applications},
  editor = {Y. Kompatsiaris, B. Merialdo and Lian, S.},
  publisher = {CRC Press, Taylor Francis LLC},
  chapter = {High-level TV talk show structuring centered on speakers interventions},
  year = {2011}
}

INTERACTIVE CLASSIFICATION OF SOUND OBJECTS FOR POLYPHONIC ELECTRO-ACOUSTIC MUSIC ANNOTATION

S. Gulluni, S. Essid, O. Buisson, and G. Richard

In AES 42nd International Conference , Jul 2011

Bib PDF

@inproceedings{SG:AES-11,
  author = {Gulluni, S. and Essid, Slim and Buisson, O. and Richard, Gael},
  title = {Interactive Classification of Sound Objects for Polyphonic Electro-Acoustic Music Annotation},
  booktitle = {AES 42nd International Conference},
  year = {2011},
  month = jul,
  address = {Ilmenau, Germany}
}

MULTIMEDIA SEMANTICS: METADATA, ANALYSIS AND INTERACTION

R. Benmokhtar, B. Huet, G. Richard, T. Declerck, and S. Essid

Jul 2011

Bib

@inbook{RB:MMABOOK-11,
  chapter = {Feature Extraction for Multimedia Analysis},
  title = {Multimedia Semantics: Metadata, Analysis and Interaction},
  publisher = {Wiley},
  year = {2011},
  editor = {R. Troncy, B. Huet and Schenk, S.},
  author = {Benmokhtar, R. and Huet, B. and Richard, Gael and Declerck, T. and Essid, Slim},
  owner = {essid},
  timestamp = {2010.12.20}
}

MULTIMEDIA SEMANTICS: METADATA, ANALYSIS AND INTERACTION

S. Essid, M. Campedel, G. Richard, T. Piatrik, R. Benmokhtar, and B. Huet

Jul 2011

Bib

@inbook{SE:MMABOOK-11,
  chapter = {Machine Learning Techniques for Multimedia Analysis},
  title = {Multimedia Semantics: Metadata, Analysis and Interaction},
  publisher = {Wiley},
  year = {2011},
  editor = {R. Troncy, B. Huet and Schenk, S.},
  author = {Essid, Slim and Campedel, M. and Richard, Gael and Piatrik, T. and Benmokhtar, R. and Huet, B.},
  owner = {essid},
  timestamp = {2010.12.20}
}

A CONDITIONAL RANDOM FIELD FRAMEWORK FOR ROBUST AND SCALABLE AUDIO-TO-SCORE MATCHING

C. Joder, S. Essid, and G. Richard

IEEE Transactions on Audio, Speech and Language Processing, Nov 2011

Bib PDF

@article{CJ:TSALP-2011,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {A Conditional Random Field Framework for Robust and Scalable Audio-to-Score
  	Matching},
  journal = {IEEE Transactions on Audio, Speech and Language Processing},
  volume = {19},
  pages = {2385 - 2397},
  number = {8},
  month = nov,
  year = {2011},
  owner = {essid},
  timestamp = {2011.02.10}
}

HIDDEN DISCRETE TEMPO MODEL: A TEMPO-AWARE TIMING MODEL FOR AUDIO-TO-SCORE ALIGNMENT

C. Joder, S. Essid, and G. Richard

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2011

Bib PDF

@inproceedings{CJ:ICASSP-11,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {Hidden Discrete Tempo Model: a Tempo-aware Timing Model for Audio-to-Score
  	Alignment},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing
  	(ICASSP)},
  year = {2011},
  address = {Prague, Czech Republic},
  month = may,
  owner = {essid},
  timestamp = {2011.01.28}
}

2010

A MULTIMODAL APPROACH TO INITIALISATION FOR TOP-DOWN SPEAKER DIARIZATION OF TELEVISION SHOWS

S. Bozonnet, F. Vallet, N. Evans, S. Essid, J. Carrive, and G. Richard

In European Signal Processing Conference (EUSIPCO) , Aug 2010

Bib PDF

@inproceedings{SB:Eusipco10,
  author = {Bozonnet, S. and Vallet, F. and Evans, N. and Essid, Slim and Carrive, J. and Richard, Gael},
  title = {A multimodal approach to initialisation for top-down speaker diarization
  	of television shows},
  booktitle = {European Signal Processing Conference (EUSIPCO)},
  year = {2010},
  address = {Allborg, Denmark},
  month = aug,
  annote = {category=inproceedings state=toappear project=audiosig dept=tsi group=aao
  	id=10394},
  file = {:D\:\\documents\\myPage\\papers\\SB_EUSIPCO-10.pdf:PDF},
  owner = {essid},
  timestamp = {2010.06.24}
}

A CONDITIONAL RANDOM FIELD VIEWPOINT OF SYMBOLIC AUDIO-TO-SCORE MATCHING

C. Joder, S. Essid, and G. Richard

In ACM Multimedia 2010 , Oct 2010

Bib

@inproceedings{CJ:ACM-2010,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {A Conditional Random Field Viewpoint of Symbolic Audio-to-Score Matching},
  booktitle = {ACM Multimedia 2010},
  year = {2010},
  address = {Florence, Italy},
  month = oct,
  owner = {essid},
  timestamp = {2010.07.06}
}

APPROCHE HI&EACUTE;RARCHIQUE POUR UN ALIGNEMENT MUSIQUE-SUR-PARTITION EFFICACE

C. Joder, S. Essid, and G. Richard

In Compression et Représentation des Signaux Audiovisuels (CORESA) , Oct 2010

Received Young Researcher Award!

Bib PDF

@inproceedings{CJ:CORESA-2010,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {Approche hi\&eacute;rarchique pour un alignement musique-sur-partition
  	efficace},
  booktitle = {Compression et Repr\&eacute;sentation des Signaux Audiovisuels (CORESA)},
  year = {2010},
  address = {Lyon, France},
  month = oct,
  note = {Received Young Researcher Award!},
  annote = {category=inproceedings state=toappear project=audiosig dept=tsi group=aao
  	id=10418},
  file = {:D\:\\documents\\myPage\\papers\\CJ_CORESA-10.pdf:PDF},
  owner = {essid},
  timestamp = {2010.06.24}
}

A COMPARATIVE STUDY OF TONAL ACOUSTIC FEATURES FOR A SYMBOLIC LEVEL MUSIC-TO-SCORE ALIGNMENT

C. Joder, S. Essid, and G. Richard

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Mar 2010

Bib PDF

@inproceedings{CJ:ICASSP-2010,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {A Comparative Study of tonal acoustic Features for a Symbolic Level
  	Music-to-Score Alignment},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing
  	(ICASSP)},
  year = {2010},
  address = {Dallas, TX, US},
  month = mar,
  annote = {category=inproceedings state=toappear project=audiosig dept=tsi group=aao
  	id=10174},
  file = {CJ_ICASSP-2010.pdf:./CJ_ICASSP-10.pdf:PDF},
  owner = {essid},
  timestamp = {2010.03.10}
}

AN IMPROVED HIERARCHICAL APPROACH FOR MUSIC-TO-SYMBOLIC SCORE ALIGNMENT

C. Joder, S. Essid, and G. Richard

In International Conference on Music Information Retrieval (ISMIR) , Aug 2010

Bib PDF

@inproceedings{CJ:ISMIR-2010,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {An Improved Hierarchical Approach for Music-to-Symbolic Score Alignment},
  booktitle = {International Conference on Music Information Retrieval (ISMIR)},
  year = {2010},
  address = {Utrecht, The Netherlands},
  month = aug,
  annote = {category=inproceedings state=toappear project=audiosig dept=tsi group=aao
  	id=10383},
  file = {:D\:\\documents\\myPage\\papers\\CJ_ISMIR-10.pdf:PDF},
  owner = {essid},
  timestamp = {2010.06.24}
}

YAAFE, AN EASY TO USE AND EFFICIENT AUDIO FEATURE EXTRACTION SOFTWARE

B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard

In International Conference on Music Information Retrieval (ISMIR) , Aug 2010

Bib PDF

@inproceedings{BM:ISMIR10,
  author = {Mathieu, B. and Essid, Slim and Fillon, Thomas and Prado, J. and Richard, Gael},
  title = {YAAFE, AN EASY TO USE AND EFFICIENT AUDIO FEATURE EXTRACTION SOFTWARE},
  booktitle = {International Conference on Music Information Retrieval (ISMIR)},
  year = {2010},
  address = {Utrecht, The Netherlands},
  month = aug,
  annote = {category=inproceedings state=toappear project=audiosig dept=tsi group=aao
  	id=10395},
  file = {:D\:\\documents\\myPage\\papers\\BM_ISMIR-10.pdf:PDF},
  owner = {essid},
  timestamp = {2010.06.24}
}

DESCRIPTEURS VISUELS ROBUSTES POUR L IDENTIFICATION DE LOCUTEURS DANS DES EMISSIONS TELEVISEES DE TALK-SHOWS

F. Vallet, S. Essid, J. Carrive, and G. Richard

In Compression et Représentation des Signaux Audiovisuels (CORESA) , Oct 2010

Bib PDF

@inproceedings{FV:CORESA-10,
  author = {Vallet, F. and Essid, Slim and Carrive, J. and Richard, Gael},
  title = {Descripteurs visuels robustes pour l identification de locuteurs
  	dans des emissions televisees de talk-shows},
  booktitle = {Compression et Repr\&eacute;sentation des Signaux Audiovisuels (CORESA)},
  year = {2010},
  address = {Lyon, France},
  month = oct,
  annote = {category=inproceedings state=toappear project=audiosig dept=tsi group=aao
  	id=10393},
  file = {:D\:\\documents\\myPage\\papers\\FV_CORESA-10.pdf:PDF},
  owner = {essid},
  timestamp = {2010.06.24}
}

ROBUST VISUAL FEATURES FOR THE MULTIMODAL IDENTIFICATION OF UNREGISTERED SPEAKERS IN TV TALK-SHOWS

F. Vallet, S. Essid, J. Carrive, and G. Richard

In IEEE International Conference on Image Processing (ICIP) , Oct 2010

Bib PDF

@inproceedings{FV:ICIP10,
  author = {Vallet, F. and Essid, Slim and Carrive, J. and Richard, Gael},
  title = {ROBUST VISUAL FEATURES FOR THE MULTIMODAL IDENTIFICATION OF UNREGISTERED
  	SPEAKERS IN TV TALK-SHOWS},
  booktitle = {IEEE International Conference on Image Processing (ICIP)},
  year = {2010},
  month = oct,
  annote = {category=inproceedings state=toappear project=audiosig dept=tsi group=aao
  	id=10393},
  file = {:./FV_ICIP-10.pdf:PDF},
  owner = {essid},
  timestamp = {2010.06.24}
}

2009

INTERACTIVE SEGMENTATION OF ELECTRO-ACOUSTIC MUSIC

S. Gulluni, S. Essid, O. Buisson, E. Favreau, and G. Richard

In 2nd International Workshop on Machine Learning and Music (MML - ECML - PKDD) , Sep 2009

Bib PDF

@inproceedings{SG:MML-09,
  author = {Gulluni, S. and Essid, Slim and Buisson, O. and Favreau, E. and Richard, Gael},
  title = {Interactive Segmentation of Electro-Acoustic Music},
  booktitle = {2nd International Workshop on Machine Learning and Music (MML - ECML
  	- PKDD)},
  year = {2009},
  address = {Bled, Slovenia},
  month = sep,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=10303},
  file = {SG_MML-09.pdf:./SG_MML-09.pdf:PDF},
  owner = {essid},
  timestamp = {2010.03.10}
}

ETUDE DES DESCRIPTEURS ACOUSTIQUES POUR L ALIGNEMENT TEMPOREL AUDIO-SUR-PARTITION MUSICALE

C. Joder, S. Essid, and G. Richard

In GRETSI , Sep 2009

Bib PDF

@inproceedings{CJ:GRETSI-09,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {Etude des descripteurs acoustiques pour l alignement temporel audio-sur-partition
  	musicale},
  booktitle = {GRETSI},
  year = {2009},
  address = {Dijon},
  month = sep,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=9946},
  file = {CJ_GRETSI-09.pdf:./CJ_GRETSI-09.pdf:PDF},
  owner = {essid},
  timestamp = {2010.03.10}
}

TEMPORAL INTEGRATION FOR AUDIO CLASSIFICATION WITH APPLICATION TO MUSICAL INSTRUMENT CLASSIFICATION

C. Joder, S. Essid, and G. Richard

IEEE Transactions on Audio, Speech and Language Processing, Jan 2009

Bib PDF

@article{CJ:TASLP-08,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {Temporal Integration for Audio Classification with Application to
  	Musical Instrument Classification},
  journal = {IEEE Transactions on Audio, Speech and Language Processing},
  year = {2009},
  volume = {17},
  pages = {174-186},
  number = {1},
  month = jan,
  annote = {category=article state=published doi=10.1109/TASL.2008.2007613 project=audiosig
  	dept=tsi group=aao documentURL=http://www.tsi.enst.fr/publications/enst/article-2009-8585.pdf
  	id=8585},
  file = {:./CJ_TASLP-09.pdf:PDF},
  url = {http://www.tsi.enst.fr/publications/enst/article-2009-8585.pdf}
}

INCORPORATING PRIOR KNOWLEDGE ON THE DIGITAL MEDIA CREATION PROCESS INTO AUDIO CLASSIFIERS

M. Lardeur, S. Essid, G. Richard, M. Haller, and T. Sikora

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Apr 2009

Bib PDF

@inproceedings{MaxL:ICASSP09,
  author = {Lardeur, M. and Essid, Slim and Richard, Gael and Haller, M. and Sikora, T.},
  title = {Incorporating prior knowledge on the digital media creation process
  	into audio classifiers},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing
  	(ICASSP)},
  year = {2009},
  address = {Taipei, Taiwan},
  month = apr,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=9336},
  file = {MaxL_ICASSP09.pdf:./ML_ICASSP-09.pdf:PDF},
}

RECONNAISSANCE DES INSTRUMENTS DANS LA MUSIQUE POLYPHONIQUE PAR D&EACUTE;COMPOSITION NMF ET CLASSIFICATION SVM

A. Ozerov, S. Essid, and M. Charbit

Apr 2009

Bib PDF

@techreport{AO:TREP-09,
  author = {Ozerov, Alexey and Essid, Slim and Charbit, M.},
  title = {Reconnaissance des instruments dans la musique polyphonique par d\&eacute;composition
  	NMF et classification SVM},
  institution = {TELECOM ParisTech},
  year = {2009},
  number = {2009D014},
  file = {AO_TREP-09.pdf:./AO_TREP-09.pdf:PDF},
  owner = {essid},
  timestamp = {2009.08.11}
}

2008

A COLLABORATIVE APPROACH TO AUTOMATIC RUSHES VIDEO SUMMARIZATION

W. Bailer, E. Dumont, S. Essid, and B. Mérialdo

In IEEE ICIP Workshop on Multimedia Information Retrieval: New Trends and Challenges , Oct 2008

Bib PDF

@inproceedings{SE:ICIP-08,
  author = {Bailer, W. and Dumont, E. and Essid, Slim and M\&eacute;rialdo, B.},
  title = {A Collaborative Approach to Automatic Rushes Video Summarization},
  booktitle = {IEEE ICIP Workshop on Multimedia Information Retrieval: New Trends
  	and Challenges},
  year = {2008},
  month = oct,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=8679},
  file = {SE_ICIP-08.pdf:./SE_ICIP-08.pdf:PDF},
}

A COLLABORATIVE APPROACH TO VIDEO SUMMARIZATION

E. Dumont, B. Merialdo, S. Essid, W. Bailer, D. Byrne, H. Bredin, N. OConnor, G. Jones, M. Haller, A. Krutz, T. Sikora, and T. Piatrik

In 3rd International Conference on Semantic and Digital Media Technologies (SAMT) , Dec 2008

Bib PDF

@inproceedings{SE:SAMT-08,
  author = {Dumont, E. and Merialdo, B. and Essid, Slim and Bailer, W. and Byrne, D. and Bredin, H. and OConnor, N. E. and Jones, G. J. F. and Haller, M. and Krutz, A. and Sikora, T. and Piatrik, T.},
  title = {A collaborative approach to video summarization},
  booktitle = {3rd International Conference on Semantic and Digital Media Technologies
  	(SAMT)},
  year = {2008},
  address = {Koblenz, Germany},
  month = dec,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=8680},
  file = {SE_SAMT-08.pdf:./SE_SAMT-08.pdf:PDF},
}

RUSHES VIDEO SUMMARIZATION USING A COLLABORATIVE APPROACH

E. Dumont, B. Merialdo, S. Essid, W. Bailer, H. Rehatschek, D. Byrne, H. Bredin, N. OConnor, G. Jones, A. Smeaton, . M. Haller, A. Krutz, T. Sikora, and T. Piatrik

In TRECVID 2008, ACM International Conference on Multimedia Information Retrieval 2008 , Nov 2008

Bib PDF

@inproceedings{SE:TRECVID-08,
  author = {Dumont, E. and Merialdo, B. and Essid, Slim and Bailer, W. and Rehatschek, H. and Byrne, D. and Bredin, H. and OConnor, N. E. and Jones, G. J. F. and Smeaton, A. F. and and M. Haller and Krutz, A. and Sikora, T. and Piatrik, T.},
  title = {Rushes video summarization using a collaborative approach},
  booktitle = {TRECVID 2008, ACM International Conference on Multimedia Information
  	Retrieval 2008},
  year = {2008},
  address = {Vancouver, BC, Canada},
  month = nov,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=8681},
  file = {SE_TRECVID-08.pdf:./SE_ACM-08.pdf:PDF},
}

ALIGNMENT KERNELS FOR AUDIO CLASSIFICATION WITH APPLICATION TO MUSIC INSTRUMENT RECOGNITION

C. Joder, S. Essid, and G. Richard

In European Signal Processing Conference (EUSIPCO) , Aug 2008

Bib PDF

@inproceedings{CJ-EUSIPCO-2008,
  author = {Joder, Cyril and Essid, Slim and Richard, Gael},
  title = {Alignment Kernels for Audio Classification with Application to Music
  	Instrument Recognition},
  booktitle = {European Signal Processing Conference (EUSIPCO)},
  year = {2008},
  address = {Lausanne, Suisse},
  month = aug,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao documentURL=http://www.tsi.enst.fr/publications/enst/inproceedings-2008-8528.pdf
  	id=8528},
  file = {:./CJ_EUSIPCO-08.pdf:PDF},
  url = {http://www.tsi.enst.fr/publications/enst/inproceedings-2008-8528.pdf}
}

ON THE ROBUSTNESS OF AUDIO FEATURES FOR MUSICAL INSTRUMENT CLASSIFICATION

S. Wegener, M. Haller, J. Burred, T. Sikora, S. Essid, and G. Richard

In European Signal Processing Conference (EUSIPCO) , Sep 2008

Bib PDF

@inproceedings{SW:EUSIPCO08,
  author = {Wegener, S. and Haller, M. and Burred, J.-J. and Sikora, T. and Essid, Slim and Richard, Gael},
  title = {On the Robustness of Audio Features for Musical Instrument Classification},
  booktitle = {European Signal Processing Conference (EUSIPCO)},
  year = {2008},
  address = {Lausanne, Switzerland},
  month = sep,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=8616},
  file = {SW_EUSIPCO08.pdf:./SW_EUSIPCO08.pdf:PDF},
}

2007

ON THE CORRELATION OF AUTOMATIC AUDIO AND VISUAL SEGMENTATIONS OF MUSIC VIDEOS

O. Gillet, S. Essid, and G. Richard

IEEE Transactions on Circuits and Systems for Video Technology, Mar 2007

Bib PDF

@article{OG:CVST-07,
  author = {Gillet, O. and Essid, Slim and Richard, Gael},
  title = {On the Correlation of Automatic Audio and Visual Segmentations of
  	Music Videos},
  journal = {IEEE Transactions on Circuits and Systems for Video Technology},
  year = {2007},
  month = mar,
  annote = {category=article state=published project=audiosig dept=tsi group=aao
  	id=6864},
  file = {OG_CVST-07.pdf:./OG_TCSVT-07.pdf:PDF},
}

TOWARDS POLYPHONIC MUSICAL INSTRUMENT RECOGNITION

G. Richard, P. Leveau, L. Daudet, S. Essid, and B. David

In International Congress on Acoustics (ICA) , Sep 2007

Bib PDF

@inproceedings{ICA:07,
  author = {Richard, Gael and Leveau, P. and Daudet, L. and Essid, Slim and David, Bertrand},
  title = {Towards polyphonic musical instrument recognition},
  booktitle = {International Congress on Acoustics (ICA)},
  year = {2007},
  address = {Madrid},
  month = sep,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=7332},
  file = {ICA_07.pdf:./GR_ICA-07.pdf:PDF},
}

COMBINED SUPERVISED AND UNSUPERVISED APPROACHES FOR AUTOMATIC SEGMENTATION OF RADIOPHONIC AUDIO STREAMS

G. Richard, M. Ramona, and S. Essid

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Apr 2007

Bib PDF

@inproceedings{GR:Icassp-07,
  author = {Richard, Gael and Ramona, M. and Essid, Slim},
  title = {Combined supervised and unsupervised approaches for automatic segmentation
  	of radiophonic audio streams},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing
  	(ICASSP)},
  year = {2007},
  address = {Honolulu, Hawai},
  month = apr,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=6862},
  file = {GR_Icassp-07.pdf:./GR_ICASSP-07.pdf:PDF},
}

K-SPACE AT TRECVID 2007

P. Wilkins, T. Adamek, D. Byrne, G. Jones, H. Lee, G. Keenan, K. Guinness, N. OConnor, A. Smeaton, A. Amin, Z. Obrenovic, R. Benmokhtar, E. Galmar, B. Huet, S. Essid, R. Landais, F. Vallet, G. Papadopoulos, S. Vrochidis, V. Mezaris, I. Kompatsiaris, E. Spyrou, Y. Avrithis, R. Morzinger, P. Schallauer, W. Bailer, T. Piatrik, K. Chandramouli, E. Izquierdo, M. Haller, L. Goldmann, A. Samour, A. Cobet, T. Sikora, and P. Praks

In TRECVID 2007 , Nov 2007

Bib PDF

@inproceedings{SE:TRECVID-07,
  author = {Wilkins, P. and Adamek, T. and Byrne, D. and Jones, G. J. F. and Lee, H. and Keenan, G. and Guinness, K. Mc and OConnor, N. E. and Smeaton, A. F. and Amin, A. and Obrenovic, Z. and Benmokhtar, R. and Galmar, E. and Huet, B. and Essid, Slim and Landais, R. and Vallet, F. and Papadopoulos, G. T. and Vrochidis, S. and Mezaris, V. and Kompatsiaris, I. and Spyrou, E. and Avrithis, Y. and Morzinger, R. and Schallauer, P. and Bailer, W. and Piatrik, T. and Chandramouli, K. and Izquierdo, E. and Haller, M. and Goldmann, L. and Samour, A. and Cobet, A. and Sikora, T. and Praks, P.},
  title = {K-Space at TRECVid 2007},
  booktitle = {TRECVID 2007},
  year = {2007},
  month = nov,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao documentURL=http://www.cdvp.dcu.ie/Papers/kspace-tv2007.pdf
  	id=7755},
  file = {:./SE_TRECVID-07.pdf:PDF},
  url = {http://www.cdvp.dcu.ie/Papers/kspace-tv2007.pdf}
}

2006

INSTRUMENT RECOGNITION IN POLYPHONIC MUSIC BASED ON AUTOMATIC TAXONOMIES

S. Essid, G. Richard, and B. David

IEEE Transactions on Audio, Speech, and Language Processing, Jan 2006

Bib PDF

@article{SE:COD-06,
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  title = {Instrument recognition in polyphonic music based on automatic taxonomies},
  journal = {IEEE Transactions on Audio, Speech, and Language Processing},
  year = {2006},
  volume = {14},
  pages = {68-80},
  number = {1},
  month = jan,
  annote = {category=article state=published doi=10.1109/TSA.2005.860351 project=audiosig
  	dept=tsi group=aao id=5909},
  file = {SE_COD-06.pdf:./SE_TSALP-06b.pdf:PDF},
}

MUSICAL INSTRUMENT RECOGNITION BY PAIRWISE CLASSIFICATION STRATEGIES

S. Essid, G. Richard, and B. David

IEEE Transactions on Audio, Speech, and Language Processing, Jul 2006

Bib PDF

@article{SE:COD-06-2,
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  title = {Musical instrument recognition by pairwise classification strategies},
  journal = {IEEE Transactions on Audio, Speech, and Language Processing},
  year = {2006},
  volume = {14},
  pages = {1401- 1412},
  number = {4},
  month = jul,
  annote = {category=article state=published doi=10.1109/TSA.2005.860842 project=audiosig
  	dept=tsi group=aao id=5910},
  file = {SE_COD-06-2.pdf:./SE_TSALP-06a.pdf:PDF},
}

HIERARCHICAL CLASSIFICATION OF MUSICAL INSTRUMENTS ON SOLO RECORDINGS

S. Essid, G. Richard, and B. David

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , May 2006

Bib PDF

@inproceedings{SE:ICASSP-06,
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  title = {Hierarchical Classification of Musical Instruments on Solo Recordings},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing
  	(ICASSP)},
  year = {2006},
  address = {Toulouse, France},
  month = may,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=6672},
  file = {SE_ICASSP-06.pdf:./SE_ICASSP-06.pdf:PDF},
}

2005

CLASSIFICATION AUTOMATIQUE DES SIGNAUX AUDIO-FR&EACUTE;QUENCES: RECONNAISSANCE DES INSTRUMENTS DE MUSIQUE

S. Essid

Université Pierre et Marie Curie , Dec 2005

Bib PDF

@phdthesis{SE:these,
  author = {Essid, Slim},
  title = {Classification automatique des signaux audio-fr\&eacute;quences:
  	reconnaissance des instruments de musique},
  school = {Universit\&eacute; Pierre et Marie Curie},
  year = {2005},
  month = dec,
  annote = {category=phdthesis state=published project=audiosig dept=tsi group=aao
  	id=7638},
  file = {SE_these.pdf:./SE_PhD-05.pdf:PDF},
}

INSTRUMENT RECOGNITION IN POLYPHONIC MUSIC

S. Essid, G. Richard, and B. David

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Mar 2005

Bib PDF

@inproceedings{SE-ICASSP-05,
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  title = {Instrument recognition in polyphonic music},
  booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing
  	(ICASSP)},
  year = {2005},
  address = {Philadelphia, US},
  month = mar,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=5067},
  file = {SE-ICASSP-05.pdf:./SE_ICASSP-05.pdf:PDF},
}

ON THE USEFULNESS OF DIFFERENTIATED TRANSIENT/STEADY-STATE PROCESSING IN MACHINE RECOGNITION OF MUSICAL INSTRUMENTS

P. Leveau, S. Essid, G. Richard, L. Daudet, and B. David

In AES convention , May 2005

Bib PDF

@inproceedings{PL-AES-05,
  author = {Leveau, P. and Essid, Slim and Richard, Gael and Daudet, L. and David, Bertrand},
  title = {On the usefulness of differentiated transient/steady-state processing
  	in machine recognition of musical instruments},
  booktitle = {AES convention},
  year = {2005},
  address = {Barcelona, Spain},
  month = may,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao id=5071},
  file = {PL-AES-05.pdf:./SE_AES-05.pdf:PDF},
}

2004

EFFICIENT MUSICAL INSTRUMENT RECOGNITION ON SOLO PERFORMANCES USING BASIC FEATURES

S. Essid, G. Richard, and B. David

In AES 25th conference , Jun 2004

Bib PDF

@inproceedings{SE:AES-04,
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  title = {Efficient musical instrument recognition on solo performances using
  	basic features},
  booktitle = {AES 25th conference},
  year = {2004},
  address = {London, UK},
  month = jun,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao,cod id=3608},
  file = {SE_AES-04.pdf:./SE_AES-04.pdf:PDF},
}

MUSICAL INSTRUMENT RECOGNITION ON SOLO PERFORMANCES

S. Essid, G. Richard, and B. David

In European Signal Processing Conference (EUSIPCO) , Sep 2004

Bib PDF

@inproceedings{SE:Eusipco-04,
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  title = {Musical instrument recognition on solo performances},
  booktitle = {European Signal Processing Conference (EUSIPCO)},
  year = {2004},
  address = {Vienna, Austria},
  month = sep,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao,cod documentURL=http://www.enst.fr/~grichard/publications.htm
  	id=4893},
  file = {:D\:\\documents\\myPage\\papers\\SE_EUSIPCO-04.pdf:PDF},
  url = {http://www.enst.fr/~grichard/publications.htm}
}

MUSICAL INSTRUMENT RECOGNITION BASED ON CLASS PAIRWISE FEATURE SELECTION

S. Essid, G. Richard, and B. David

In International Conference on Music Information Retrieval (ISMIR) , Oct 2004

Bib PDF

@inproceedings{SE:ISMIR-04,
  author = {Essid, Slim and Richard, Gael and David, Bertrand},
  title = {Musical instrument recognition based on class pairwise feature selection},
  booktitle = {International Conference on Music Information Retrieval (ISMIR)},
  year = {2004},
  address = {Barcelona, Spain},
  month = oct,
  annote = {category=inproceedings state=published project=audiosig dept=tsi
  	group=aao,cod documentURL=http://www.enst.fr/~grichard/publications.htm
  	id=4892},
  file = {:./SE_ISMIR-04.pdf:PDF},
  url = {http://www.enst.fr/~grichard/publications.htm}
}

2003

MOD&EGRAVE;LES SINUSO&IUML;DAUX &EACUTE;TENDUS POUR LE CODAGE AUDIO

R. Boyer, S. Essid, K. Abed-Meraim, and N. Moreau

In Dix-neuvième colloque sur le Traitement du Signal et des Images , Sep 2003

Bib PDF

@inproceedings{SE:GRETSI-03,
  author = {Boyer, R. and Essid, Slim and Abed-Meraim, K. and Moreau, N.},
  title = {Mod\&egrave;les sinuso\&iuml;daux \&eacute;tendus pour le codage
  	audio},
  booktitle = {Dix-neuvi\&egrave;me colloque sur le Traitement du Signal et des
  	Images},
  year = {2003},
  address = {Paris, France},
  month = sep,
  annote = {category=inproceedings state=published project= dept=tsi group=aao,cod,tsac
  	id=3453},
  file = {SE_GRETSI-03.pdf:./RB_GRETSI-03.pdf:PDF},
}

2002

TRANSIENT MODELING WITH A FREQUENCY-TRANSFORM SUBSPACE ALGORITHM AND TRANSIENT + SINUSOIDAL SCHEME

R. Boyer, and S. Essid

In 14th IEEE Int. Conf. on Digital Signal Proc. , Jul 2002

Abs Bib PDF

We present an efficient modeling method for strong transient character audio signals. It is shown that the parametric non-stationary exponentially damped sinusoids (EDS) model permits good performance for time domain modeling of quasi-stationary signals or "weak" transients. However, a decay in modeling performance is observed when dealing with highly nonstationary signals as in a variety of musical sounds (various percussions, castanets, triangle,...). The idea is then to process the signal in a well chosen frequency-transform domain in which the transient temporal characteristics are better modeled by EDS. As a result, better representations of the transient signal class are obtained with no pre-echo artifacts (energy before the attack) and a very good signal onset dynamic reproduction. Finally, an original "transient+sinusoidal" modeling scheme is proposed.
@inproceedings{RB:DSP-02, author = {Boyer, R. and Essid, Slim}, title = {Transient modeling with a Frequency-Transform Subspace Algorithm and Transient + Sinusoidal scheme}, booktitle = {14th IEEE Int. Conf. on Digital Signal Proc.}, year = {2002}, address = {Santorini (Greece)}, month = jul, annote = {category=inproceedings state=published project= dept=tsi group=cod id=2014}, file = {RB_DSP-02.pdf:./RB_DSP-02.pdf:PDF}, }
DYNAMIC TEMPORAL SEGMENTATION IN PARAMETRIC NON-STATIONARY MODELING FOR PERCUSSIVE MUSICAL SIGNALS

R. Boyer, S. Essid, and N. Moreau

In IEEE Int. Conf. on Multimedia and Expo (ICME) , Aug 2002

Abs Bib PDF

An audio signal parametric modeling scheme is proposed that permits higher performance for representing strong sound transients. The exponentially damped sinusoids (EDS) model is considered in association with a high resolution parameter estimation approach. Such a technique is well adapted to almost every audio signal but is unfortunately not efficient when dealing with signals presenting strong temporal variations, such as percussive music signals, and causes pre-echo artifacts and weak onset dynamic reproduction which are prejudicial to listening. A system, based on the EDS model, has been developed with a transient detector and dynamic time segmentation and modeling that allows to overcome such artifacts.
@inproceedings{RB:ICM-02, author = {Boyer, R. and Essid, Slim and Moreau, N.}, title = {Dynamic temporal segmentation in parametric non-stationary modeling for percussive musical signals}, booktitle = {IEEE Int. Conf. on Multimedia and Expo (ICME)}, year = {2002}, address = {Lausanne, Switzerland}, month = aug, annote = {category=inproceedings state=published project= dept=tsi group=cod id=2013}, file = {RB_ICM-02.pdf:./RB_ICM-02.pdf:PDF}, }
NON-STATIONARY SIGNAL PARAMETRIC MODELING TECHNIQUES WITH AN APPLICATION TO LOW BITRATE AUDIO CODING

R. Boyer, S. Essid, and N. Moreau

In 6th IEEE Int. Conf. Signal Processing , Aug 2002

Abs Bib PDF

Low bit rate audio coding often relies on Fourier representation despite its limitations for transient signal modeling. This study proposes alternative decompositions and expansion strategies that lead to more accurate modeling. Two classes of methods are considered, subspace decomposition methods, and atomic decomposition methods and their performances are compiled to propose an audio modeling scheme amenable to low bit rate coding.
@inproceedings{RB:ICS-02, author = {Boyer, R. and Essid, Slim and Moreau, N.}, title = {Non-stationary signal parametric modeling techniques with an application to low bitrate audio coding}, booktitle = {6th IEEE Int. Conf. Signal Processing}, year = {2002}, address = {Beijing, China}, month = aug, annote = {category=inproceedings state=published project= dept=tsi group=cod id=2012}, file = {RB_ICS-02.pdf:./RB_ICS-02.pdf:PDF}, }

CODEUR AUDIO PARAM&EACUTE;TRIQUE BAS D&EACUTE;BIT BAS&EACUTE; SUR UN MOD&EGRAVE;LE "SINUSO&IUML;DES AMORTIES EXPONENTIELLEMENT + TRANSITOIRES + BRUIT"

S. Essid

Ecole Nationale Supérieure des Télécommunications (ENST) , Oct 2002

Bib PDF

@mastersthesis{SE:master-02,
  author = {Essid, Slim},
  title = {Codeur audio param\&eacute;trique bas d\&eacute;bit bas\&eacute;
  	sur un mod\&egrave;le "Sinuso\&iuml;des Amorties Exponentiellement
  	+ Transitoires + Bruit"},
  school = {Ecole Nationale Sup\&eacute;rieure des T\&eacute;l\&eacute;communications
  	(ENST)},
  year = {2002},
  month = oct,
  file = {dea.pdf:./SE_MastTh-02.pdf:PDF},
  owner = {essid},
  timestamp = {2009.08.11}
}

2001

EXPLORATION DE TECHNIQUES MODERNES DE MOD&EACUTE;LISATION ADAPT&EACUTE;ES &AGRAVE; DU CODAGE AUDIO BAS-D&EACUTE;BIT

R. Boyer, S. Essid, and N. Moreau

In 7èmes Journées d Etudes et d Echanges : Compression et Représentation des Signaux Audiovisuels (CORESA) , Oct 2001

Bib PDF

@inproceedings{Boyer_01,
  author = {Boyer, R. and Essid, Slim and Moreau, N.},
  title = {Exploration de techniques modernes de mod\&eacute;lisation adapt\&eacute;es
  	\&agrave; du codage audio bas-d\&eacute;bit},
  booktitle = {7\&egrave;mes Journ\&eacute;es d Etudes et d Echanges : Compression
  	et Repr\&eacute;sentation des Signaux Audiovisuels (CORESA)},
  year = {2001},
  address = {Dijon, France},
  month = oct,
  annote = {category=inproceedings state=published project= dept=tsi group=cod
  	id=1971},
  file = {RB_CORESA-01.pdf:./RB_CORESA-01.pdf:PDF},
}