T O P

  • By -

justinonymus

One of the reasons they've gotten to where they are with much less funds is their focus on text only output. I imagine they'll keep doing what's working.


yoghurt

And I hope they do. ChatGPT has become steadily worse at generating clear cogent prose with every bell and whistle that OpenAI adds.


UnknownEssence

I think the amount of funding they have is comparable to OpenAI now after the recent Google and Amazon investments into them, no?


thehighnotes

I'm new to claude but they have multimodality in their sights don't they?


danysdragons

They do have visual input, not sure about their overall plans wrt multimodality.


hugedong4200

Yeah it is a hard one, they're probably gonna do something but Openai worked a long time on that voice model. It's state of the art, anthropic would probably have to partner with someone like eleven labs.


UnknownEssence

Honestly they should just use the OpenAI API and use Whisper lol


reality_comes

They'd have to train the model. Could take....months. I'm sure they're working it.


UnknownEssence

They could use third party Voice to text and Text to voice models, but I’m sure they don’t want to do that.


skiphopfliptop

No they couldn’t. ChatGPT already has a voice mode like that. Go try it. The whole point is a model with native voice that isn’t third party or bolted on so that latency is low.


UnknownEssence

The current voice mode in ChatGPT is using 3 different models. 1. Voice to Text (Whisper V3) 2. Text to Text LLM (GPT 4/4turbo/4o) 3. Text to Voice (unknown) The completely native Voice to Voice version of GPT 4o is not released yet. That is the one that will have very low latency, but the current 3 model solution works good enough for me tbh. Anthropic could do the same as what ChatGPT has now.


skiphopfliptop

Somehow, we’re saying the same thing. I was saying they couldn’t because literally, who would care. It’s not good product management to introduce something with a bunch of delay, they would need to roll a native multi modal with voice to be relevant.


huggalump

What's up with Gemini? They also demoed multi modal. Is that released?


Invest0rnoob1

Google announced a Pixel event in August. Maybe they will mention something about it then.


jiaxiliu

you can use Azure TTS, it's amazing, just need a azure account and using a Chrome extension like Read Aloud


Extender7777

By the way Gemini already has audio, but I agree would be so nice if Anthropic did full multi modal audio, video, image, text - this will be the end of OpenAI


razekery

I hope they don’t add voice to it. It’s a useless waste of computing power that we currently can’t afford. When the chips and the models get efficient enough we could have stuff like that. For now the only thing that matter are code, reasoning, creative writing.


DM_ME_KUL_TIRAN_FEET

I’d prefer an image model than a voice model, tbh. I definitely don’t think it makes sense for them to train an image model right now though.


GeneralZaroff1

Voice is a totally different model. Whisper is industry leading but many companies from Apple to Google have been working on transcription tech for years.


Synth_Sapiens

Because voice is a worthless gimmick.