Masterplan to PhD

Research Focus

My PhD research focuses on Human-Robot Interaction (HRI), specifically on improving Human-Robot Conversations (HRC) through visual models. The goal is to enhance Visual Voice Activity Detection (VVAD) and Active Speaker Detection(ASD) to make them robust and applicable in HRI scenarios. This involves addressing several dimensions:

  • Transitioning from single-person(VVAD) to multi-person(ASD) interactions.
  • Exploring input sizes, ranging from:
    • Lip Features -> Face Features -> Lip Images -> Face Images -> Full Images.
  • Producing outputs such as:
    • Speaking/Not Speaking Labels and
    • Bounding Boxes for active speakers.
  • optionally: Exploring multimodal Models for Audio-Visual Voice Activity Detection and Active Speaker Detection

Research Pillars

My PhD research is structured around three main pillars:

1. Data

This pillar focuses on the VVAD-LRS3 dataset, which I have created, and the associated pipelines for working with the data. These pipelines ensure efficient data preprocessing, augmentation, and management to support robust model training.

2. Models

This pillar involves the development and training of models using the VVAD-LRS3 and other datasets. It also includes the algorithms for pre- and post-processing, as well as the implementation efforts to make these models accessible. This includes:

  • Optimized training pipelines.
  • Scalable and efficient model architectures.
  • Tools and libraries to facilitate easy integration.

3. Application

This pillar focuses on the use cases where the models are applied. It includes conducting experiments to validate the models in real-world scenarios and exploring their utility in various Human-Robot Interaction (HRI) contexts.

Research Process

Motivation

I believe it is crucial for humans to interact with robots in a natural way in the future. Currently, Human-Robot Conversations are highly one-dimensional, often resembling simple chatbots. However, the physical form, sensors, and actuators of robots allow for the integration of additional modalities, which can significantly enhance the interaction experience.

Goals

The primary goal of my PhD is to make VVAD and Active Speaker Detection not only robust but also practical for robotics applications. This includes:

  1. Developing models that are efficient and scalable:
    • Creating smaller versions of the models that can run on less powerful hardware, beyond high-performance GPUs.
  2. Providing accessible implementations:
    • Publishing the final implementation as a Python library.
    • Offering a ROS2 Node to make the results easily available for robotic systems.
    • Packaging the solution as a Docker container for out-of-the-box usability.

By achieving these goals, I aim to bridge the gap between cutting-edge research and practical applications in robotics, enabling more natural and multimodal human-robot interactions.

Publication Plan

Published Papers

  1. The VVAD-LRS3 Dataset for Visual Voice Activity Detection
    A. Lubitz, M. Valdenegro-Toro, F. Kirchner
    Awarded with Best Student Paper Award
    Published: 2021
    Presented at: 7th International Conference on Human Computer Interaction Theory and Applications

  2. A Bayesian Approach to Context-based Recognition of Human Intention for Context-Adaptive Robot Assistance in Space Missions
    A. Lubitz, O. Arriaga, T. Hassan, N. Hoyer, E.A. Kirchner
    Published: 2022
    Presented at: SpaceCHI 2.0: Human-Computer Interaction for Space

  3. Cobair: A Python Library for Context-Based Intention Recognition in Human-Robot-Interaction
    A. Lubitz, L. Gutzeit, F. Kirchner
    Published: 2023
    Presented at: 32nd IEEE International Conference on Robot and Human Interactive Communication

  4. Improving Human-Robot Communication in Noisy Environments with Visual Voice Activity Detection
    A Gopikrishnan, A Auer, L Gutzeit
    Published: 2025
    Presented at: International Conference on Computer-Human Interaction Research and Applications

    Planned Papers

  5. Review Paper: State of the Art of VVAD in HRI

  6. New Models for VVAD

  7. Comparison of VVAD Datasets

  8. Implementing Vison-Only ASD

  9. Experiments with Vison-Only ASD

  10. A Benchmark for VVAD in HRI

  11. A Wizard of Oz experiment for Gaze Backchannels

Possible Side Papers (Could Be Theses)

  1. Social Data Analyzer

  2. HelloRic Project Presentation Paper

  3. CoBaIR Evaluation

Possible Ideas for a Post-Doc

  1. Speech Prediction

  2. Mind Wandering Detection

  3. Video to Voice Filters