Hi there,

My name is “Mohamed Anwar”, but you can call me “Anwar”. I’m on the purge of starting my PhD journey, so your well-wishes are greatly appreciated. My primary research interest focuses on the Speech and Sound Processing domain, with a specific interest in audio-visual speech processing. Motivated by the Bimodal Perception of Speech in humans, I’m very driven to contribute to the development of the next generation of speech-based models capable of leveraging visual signals, such as facial expressions and body language, to expand models’ capabilities.

In 2023, I was very fortunate to be part of the AI residency program at Meta, where I was advised by Changhan Wang and Dr. Bowen Shi. I earned my African Master’s for Machine Intelligence (AMMI) in 2022. My master’s thesis was advised by Dr. Julia Kreutzer and Melvin Johnson and it was about code-switching in machine translation. Also, I interned at Naver Labs Europe under the supervision on Prof. Laurent Besacier and Inyoung Kim. Prior to my academic pursuit, I was an R&D software engineer at IST Networks where I was one of the main contributors to Nūn, an Arabic Text-to-Speech system, and botter, a chatbot with transferable skills.

Here is a selection of my research:


2024

  1. A Comprehensive Analysis of Human-centric Audio-Visual Learning in Speech: A survey

    Coming soon (February 2024)!! …


  2. XLAVS-R: Cross-Lingual Audio-Visual Speech Representation from Efficient Modality Injection

    Coming soon (January 2024)!! …


2023

  1. MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

    We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translation and the largest open benchmark for multilingual audio-visual speech recognition. Our baseline results show that MuAViC is effective for building noise-robust speech recognition and translation models. We make the corpus available at https://github.com/facebookresearch/muavic. …


2022

  1. The Effect of Alignment Objectives on Code-Switching Translation

    One of the things that need to change when it comes to machine translation is the models’ ability to translate code-switching content, especially with the rise of social media and user-generated content. In this paper, we are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another, along with translating code-switched sentences to either language. This model can be considered a bilingual model in the human sense. For better use of parallel data, we generated synthetic code-switched (CSW) data along with an alignment loss on the encoder to align representations across languages. Using the WMT14 English-French (En-Fr) dataset, the trained model strongly outperforms bidirectional baselines on code-switched translation while maintaining quality for non-code-switched (monolingual) data. …


  2. True Bilingual Neural Machine Translation

    Bilingual machine translation permits training a single model that translates monolingual sentences from one language to another. However, a model is not truly bilingual unless it can translate back and forth in both language directions it was trained on, along with translating code-switched sentences to either language. We propose a true bilingual model trained on WMT14 English-French (En-Fr) dataset. For better use of parallel data, we generated synthetic code-switched (CSW) data along with an alignment loss on the encoder to align representations across languages. Our model strongly outperforms bilingual baselines on CSW translation while maintaining quality for non-code switched data. …