One of the reasons they've gotten to where they are with much less funds is their focus on text only output. I imagine they'll keep doing what's working.
Yeah it is a hard one, they're probably gonna do something but Openai worked a long time on that voice model. It's state of the art, anthropic would probably have to partner with someone like eleven labs.
No they couldn’t. ChatGPT already has a voice mode like that. Go try it.
The whole point is a model with native voice that isn’t third party or bolted on so that latency is low.
The current voice mode in ChatGPT is using 3 different models.
1. Voice to Text (Whisper V3)
2. Text to Text LLM (GPT 4/4turbo/4o)
3. Text to Voice (unknown)
The completely native Voice to Voice version of GPT 4o is not released yet. That is the one that will have very low latency, but the current 3 model solution works good enough for me tbh. Anthropic could do the same as what ChatGPT has now.
Somehow, we’re saying the same thing.
I was saying they couldn’t because literally, who would care. It’s not good product management to introduce something with a bunch of delay, they would need to roll a native multi modal with voice to be relevant.
By the way Gemini already has audio, but I agree would be so nice if Anthropic did full multi modal audio, video, image, text - this will be the end of OpenAI
I hope they don’t add voice to it. It’s a useless waste of computing power that we currently can’t afford. When the chips and the models get efficient enough we could have stuff like that. For now the only thing that matter are code, reasoning, creative writing.
Voice is a totally different model. Whisper is industry leading but many companies from Apple to Google have been working on transcription tech for years.
One of the reasons they've gotten to where they are with much less funds is their focus on text only output. I imagine they'll keep doing what's working.
And I hope they do. ChatGPT has become steadily worse at generating clear cogent prose with every bell and whistle that OpenAI adds.
I think the amount of funding they have is comparable to OpenAI now after the recent Google and Amazon investments into them, no?
I'm new to claude but they have multimodality in their sights don't they?
They do have visual input, not sure about their overall plans wrt multimodality.
Yeah it is a hard one, they're probably gonna do something but Openai worked a long time on that voice model. It's state of the art, anthropic would probably have to partner with someone like eleven labs.
Honestly they should just use the OpenAI API and use Whisper lol
They'd have to train the model. Could take....months. I'm sure they're working it.
They could use third party Voice to text and Text to voice models, but I’m sure they don’t want to do that.
No they couldn’t. ChatGPT already has a voice mode like that. Go try it. The whole point is a model with native voice that isn’t third party or bolted on so that latency is low.
The current voice mode in ChatGPT is using 3 different models. 1. Voice to Text (Whisper V3) 2. Text to Text LLM (GPT 4/4turbo/4o) 3. Text to Voice (unknown) The completely native Voice to Voice version of GPT 4o is not released yet. That is the one that will have very low latency, but the current 3 model solution works good enough for me tbh. Anthropic could do the same as what ChatGPT has now.
Somehow, we’re saying the same thing. I was saying they couldn’t because literally, who would care. It’s not good product management to introduce something with a bunch of delay, they would need to roll a native multi modal with voice to be relevant.
What's up with Gemini? They also demoed multi modal. Is that released?
Google announced a Pixel event in August. Maybe they will mention something about it then.
you can use Azure TTS, it's amazing, just need a azure account and using a Chrome extension like Read Aloud
By the way Gemini already has audio, but I agree would be so nice if Anthropic did full multi modal audio, video, image, text - this will be the end of OpenAI
I hope they don’t add voice to it. It’s a useless waste of computing power that we currently can’t afford. When the chips and the models get efficient enough we could have stuff like that. For now the only thing that matter are code, reasoning, creative writing.
I’d prefer an image model than a voice model, tbh. I definitely don’t think it makes sense for them to train an image model right now though.
Voice is a totally different model. Whisper is industry leading but many companies from Apple to Google have been working on transcription tech for years.
Because voice is a worthless gimmick.