OpenAI does not provide many details about their models these days but they do mention that the "Advanced voice" within ChatGPT operates on audio input directly:
> Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.
> Advanced voice uses natively multimodal models, such as GPT-4o, which means that it directly “hears” and generates audio, providing for more natural, real-time conversations that pick up on non-verbal cues, such as the speed you’re talking, and can respond with emotion.
From https://help.openai.com/en/articles/8400625-voice-mode-faq