Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation accepted to EMNLP 2025!