OpenAI does not provide many details about their models these days but they do m...

OpenAI does not provide many details about their models these days but they do mention that the "Advanced voice" within ChatGPT operates on audio input directly:

> Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.

From https://help.openai.com/en/articles/8400625-voice-mode-faq