Depression Detection by Text-Audio Fusion

Description In various disciplines, unstructured information about the same object of investigation can be obtained through one or more different channels and stored in separate digital formats. We use the term "modality" for each such form, for instance, audio or text modality.
In the area of deep learning, specific forms of neural networks, such as convolutional neural networks for image, were created for a number of modalities in order to fully leverage the potentials. Nevertheless, it is rare that a single modality provides complete knowledge of the phenomenon of interest. For example, in many medical studies with extensive data collection, such as depression, the most common psychological condition, several modalities are available for investigation. It is currently common to fuse multiple modalities either at a very early stage, as pre-trained representations, or at a late stage of the prediction process. However, many studies indicate that more meaningful representations can be learned and insights detected by a more integral attention fusion.
The aim of this study is to simultaneously analyse text and audio depression data and find meaningful patterns.  
Task In this thesis, the student(s) will analyse the related characteristics of depression based on spoken language (transcripted text and audio recordings) and implement a robust method to fuse multiple modalities and extract useful information.
Utilises Word Embeddings, Deep Spectrum System, auDeep, Attention Neural Networks, Recurrent Neural Networks.
Requirements Preliminary knowledge in Machine Learning and Natural Language Processing, Good programming skills (e.g. Python, C++).
Languages English or German.
Supervisor Shahin Amiriparian, M. Sc. ( & Lukas Stappen, M. Sc. (