Abstract: Recently, various audio-visual speech recognition (AVSR) systems have been developed by using multimodal learning techniques. One key issue is that most of them are based on 2D audio-visual ...