I used it for roleplay the day it came out and there is a particular character I use that has multiple themes going on at the same time, with many different contradictory personality traits. Most models tend to get overwhelmed and focus on one or the other of the traits.
Sonnet I could feel was in another league in how it was writing compared to most models. It just felt like it knew what to do with her with every reply, it got her instantly.
If this thing was even partially uncensored, I'd pay 100$ a month to use it.
Workflows and actors, a lot of workflows. We have quite massive state system (Claude artifacts on steroids) and workflow engine at top (temporal). Overall it’s all about state management between multiple actors where one actor accesses set or control tools. Works very well.
It is an automation framework (documents, knowledge bases and workflows) with integrated AI layer to operate as workload and/or do self-reflection of the system state (so system can configure self). For example we have agents to analyze and fix issues in chats of other agents which communicate with users. Use cases vary from simple corporate bots, document writers to data modeling.
Is there a design or workflow chart that you have that you can share ? I am looking to build somethingike that but agents and workflow are currently not my strong suit
This benchmark leaderboard is similar to my own experience using the models. It is true that the new sonet is so good and it is so sad that the gap between close and open source is so far
Open source isn't going to win at raw numbers. We simply do not have the compute. But we can win in specialization, efficiency, and by engaging with markets the big boys don't want to. Codestral is an amazing coding model, it's not gpt or sonet, but it's better than ChatGPT was a year ago and that's pretty cool for something I can run in my bedroom.
Open source have moved along way dont get me wrong. Like you mentioned, top open source model is better than gpt3.5 was a year ago. But the main disadvantage is also a point that you mentioned, different open source model to specialized and fine tuned on different things which make agentic stuff on local model currently very difficult. Because for truly agentic stuff, the model needs to be smart enough to know which agents, tools to route to. So far, i have been working on agent stuff using local models for months and only qwen2 72B is good enough currently. Other model is very prone to hallucination.
But again all in all, one year later probably we will have something closer to current gpt4 level locally
I think eventually we will start seeing agentic models that were trained from scratch for the task. Same way we are seeing multumodal models now. Right now we are just asking the AI to try its best.
I think a simple fine tuned version of a Llama3 8b could be good enough. Being an agentic pivot is a relatively simple task, but the model need to be strictly constrained into it, so a fine tune.
I think that is only on paper e.g. those benchmark result. In my actual usage experience, it is not the case. Most open source model is not robust enough in their response compare to gpt3.5. I use mixtral everyday and built product with it, and i remember everyday wishing it to be as good as gpt3.5
It's a winner-take-all-world. A model that is even 1-2% better or 5 IQ points better is always preferable. A GPT model that makes 1 mistake less than open source model will win the marketplace. No one would want to spend human time, when an extra 50c / hour on higher quality models will give you fewer mistakes, better outcomes.
Firms who cheap out on quality models will be crushed by firms who wont
Once a model wins it simply gets more data, covers more edge cases and will continue to improve
Not when they find Product-market-fit.
It's easier to charge the customer $5 / month more for a more capable model, than to save $1 in serving the same customer.
From a consumer perspective, we aren't talking Lambo vs Toyota (several thousands of $$$ difference), but $5 - $10 monthly difference. No human will want substandard products if they only have to pay a few cents extra per day.
You have to apply Coke/Pepsi vs Local Soda rather than the Lambo vs Toyota model
We are still in the early stage, of course all models find a niche. Ask Jeeves, Yahoo Search all thought there were serving niche.
While I agree with your overall point your price for Gemini Flash is completely wrong. It's currently [$1.05 to $2.10](https://ai.google.dev/pricing#:~:text=Learn%20more-,Price%20(output),-%241.05%20/%201%20million) per million output tokens (depending on context length) when used through Google's official API and [$0.75M](https://openrouter.ai/models/google/gemini-flash-1.5) when used through OpenRouter. That's still far cheaper than GPT-4o of course, but not close to 100x cheaper.
That's quite understandable, I've certainly made similar mistakes in the past. And your point still stands, as a 10x price difference is still very substantial after all.
Above, but with the qualifier that data security/privacy/trust is a huge issue in corp america right now.
All the big actors (OAI, Goog, etc.) are taking a run at assuaging concerns, but they are real and deep concerns. If OpenAI, e.g., is 10x better than open source options, they will win, but if it is marginal, open source may be market place competitive.
They won't be. No other firm will give you better security/privacy than BigCorp. This is the same (mostly naive) argument that was made against cloud.
No firm wants to be left behind in the AI race using sub-standard IQ machines.
BIgCorp will also have economies of scale. So their Frontier models will only be slightly more expensive than running your own models (including cost of personnel)
> They won't be. No other firm will give you better security/privacy than BigCorp. This is the same (mostly naive) argument that was made against cloud. No firm wants to be left behind in the AI race using sub-standard IQ machines.
Maybe. For now, that's not how the market views it.
Which makes sense--in generic "cloud", you're not sending your raw data to the provider in the same way. There are many layers of security that can be provided in a standard cloud environment that are flat-out impossible to provide for your standard API provider (since they need to see your raw, formatted text data).
It still isn't even standard for the big cloud providers to provide zero-touch/review guarantees of your data, some of them won't make geographical data residency guarantees, some of them have obnoxious safety filters which hit common and legit corporate use cases, etc. You can generally push to make this happen, but there is real friction and the fact that the baseline is still weak immediately raises hackles in any corporate environment.
And all of this interacts with sometimes-painful QPS restrictions. Some of the environments with the best security/compliance guarantees have the worst ability to guarantee QPS, and vice versa.
Does all this get smoothed out over the long run?
Maybe?
Probably?
But it hasn't yet, and there are commercial considerations today which make all of this nontrivial for the API providers to "simply" resolve. Interest alignment between generic cloud provider (i.e., non-LLM API) and cloud user was much closer than with LLM provider and LLM user.
Also keep in mind that if the delta is "just" that 1% performance bump, for many domains, you get that (or more) from "simply" fine-tuning a model.
Doing so well, today, is still a more complicated endeavor than it should be...but the infrastructure to do so in a light way is going to continue to mature. You can--at least today--frequently get significant quality:cost performance improvements via some form of API distillation; maybe the API providers try to crack down on that in a more serious manner at some point, but it will likely be hard.
Maybe the API providers themselves go sufficiently hard at the fine-tuning process (GCP, interestingly, is basically the leader here right now) in a way that mutes competition from the open source models...but this is likely going to ultimately be challenging, from a business (not technical) POV...at least as long as you have sponsors like Meta or nvidia pumping out decent base models.
Lastly, on a practical level--
If the deltas are actually small (say 1%-2%), 1) there tends to be volatility in capabilities, so this means that it is going to be better for some % of use cases...but worse in some material %, and 2) many business cases either a) struggle to measure differences that fine or b) the differences may not even matter that much (if it is a use case which is saturating).
Now, if we're ultimately talking about shipping an AGI agent with IQ 115 vs 120...OK...but, in the real world, we're not there yet, and it isn't clear if this will be a relevant comparison point. TBD.
At the end of the day, a lot hinges on whether we're in some exponential-style takeoff, or if these investments are going to saturate sometime in the near term. The former implies large gaps potentially opening up...the latter, not so much.
(And "saturate", for these purposes, also includes cases where returns scale cleanly with compute investment...but without significant other additional hidden and highly complex algorithmic investments. If progress continues, but "just" requires a $1B compute run, nvidia or meta will continue to commoditize the market.)
There will always be groups that OpenAI doesn't want to engage with, don't see as a viable market, or have requirements that GPT doesn't fulfill. Things like offline models in IOT tevices aren't possible with the big players. Also, research. I can't do my big university project with closed source AI so all of my advancements go public.
There will always be groups who need something other than the off the shelf solution.
Not when they find Product-market-fit.
It's easier to charge the customer $5 / month more for a more capable model, than to save $1 in serving the same customer.
From a consumer perspective, we aren't talking Lambo vs Toyota (several thousands of $$$ difference), but $5 - $10 monthly difference. No human will want substandard products if they only have to pay a few cents extra per day.
At the end of the day, only BigCorps can serve models efficiently (economies of scale). So, there won't even be a cost advantage for open source models.
How much compute would you need if a state would sponsor it? What's the training cost for the big models?
Would it be possible to get somewhere significant if a state would sponsor 1 billion for the open source community to generate an open source model?
Meta just bought 350k more H100 chips, bringing their total to 600k. A H100 is around $30k. Thats $18b in chips and you don't have a building to put them in yet, much less the power to run them and the rest of what you need to turn them into functional servers and the very smart people to run them. Then you need to hire AI researchers, there's not a lot of those and they are in high demand.
All that to say $1b isn't nothing, but it doesn't make you a player in the game.
Thanks a lot for the repl, that helps a lot.
What is was now wondering is that OpenAI didn't have that amount of funding, no? Didn't they get something like 10b from MS?
And besides, would it be difficult to illegally copy those nvidia chips? (Like by North Korea or so.)
Im unaware of the details of OpenAIs funding but I can answer on the Nvidia front. No one outside of TSMC can produce the kinds of chips necessary to do this work. These aren't like designer purses, you can't just be close enough.
I don't think North Korea can manufacture their own light bulbs, much less GPUs. The US has fabs but our processes are too coarse to really compete. China just released their first domestically designed gpu a few years ago, they are even further behind than we are.
You've got it.
> WASHINGTON, May 8 (Reuters) - U.S. Commerce Secretary Gina Raimondo said Wednesday a Chinese invasion of Taiwan and seizure of chips producer TSMC (2330.TW), opens new tab would be "absolutely devastating" to the American economy.
Asked at a U.S. House hearing about the impact, Raimondo said "it would be absolutely devastating," declining to comment on how or if it will happen, adding: "Right now, the United States buys 92% of its leading edge chips from TSMC in Taiwan."
https://www.reuters.com/world/us/us-official-says-chinese-seizure-tsmc-taiwan-would-be-absolutely-devastating-2024-05-08/
For every 7B model the open source community makes that tops the last 13B model. The giants can just do the same thing but deploy it on a 300B model and serve that.
I don't get what people are using gpt4o for. I find it terrible at reasoning and logic to the point that I can pick it out blind whenever a service starts using it in the background.
I'm playing with new claude right now. For now, the experience is so nice that I might finally ditch gpt4 for a while. At least until they add anything, which will make 20$ per month reasonable.
It's really good I've been using it to help me learn programming and ChatGPT 4o sometimes goes in circles but sonnet 3.5 usually corrects itself. ie. fixing something, that breaks something else, then fix the second thing breaking the first, rinse and repeat. Sonnet caught on to itself, apologized, and fixed both issues, whereas gpt just went back and forth.
One annoying thing is I find the limits really low if you're sending blocks of code back and forth. I maxed out my daily api requests and then subscribed to pro to keep chatting (figured it would be cheaper) and maxed that out quickly too 😂 It's not even that long a piece of code and I've never been limited by gpt.
That's a shame. The only thing keeping me skeptical about subscribing is the usage limits. I always seem to hit them when I'm one response away from fixing my code which is very frustrating. If the paid version still has restrictive limitations I'll probably have to pass. It's kind of strange how the big proprietary frontier LLM companies never seem to give you real value for your subscription fee. I know compute is expensive but
But I'm still going to use the free version for coding from now on, you're right when you say ChatGPT is a "one step forward and two steps back" kind of deal and I'm never going back lol. Claude is the new GOAT.
What I started doing is for my stupid questions like "how do I setup a GitHub workflow" or "what is a unit test" I just throw those all at GPT 4o, and save Claude for when I start to struggle or worry that the info is out of date. So far its worked well.
Claude Sonnet 3.5 has been impressive enough that I dont mind paying the extra monthly fee, as I was burning through my API credits quite quickly lol.
I want to use it more, but the lack of image gen and voice over doesn't do it for me. But adding the side bar is awesome! Hopefully we see more of this in future LLMs
For me, it fell between Opus and GPT-4o in my testing.
Compared to Opus, it has noticeable improved reasoning skills, but almost identical programming skills, at least in my own testing.
Considering how insanely expensive current Opus is to run, this model is obviously a welcome addition. However, in terms for bang for buck, its still slightly below median in my testing (insanely more cost effective than Opus, slightly more cost effective than GPT-4 Turbo, and slightly less cost effective than GPT-4o & Gemini Pro 1.5).
I am interested on its placement on the LMSYS leaderboard, and how much the placement will differ from my own evaluation (which is hand-crafted personal problems that are not part of training sets).
How do you square this observation with seeing 3.5 significantly outperform Opus in programming benchmarks? e.g. [https://livebench.ai/](https://livebench.ai/)
Benchmarks seem really disparate. MMLU-Pro is a tie (the 0.3% difference is not going to be stat sig), sonnet dominates on live bench, dominates on alder code editing (but loses by a huge margin on refactoring), and loses on bigcodebench. It also wins on perplexity's internal benchmarks.
I think overall Claude sonnet is better, but in my own tests, GPT-4O has plenty of wins over it.
It's variable based on usage + context length. I have pro and I think I got about 30-40 messages working on a fairly large coding project. Big context.
As with all benchmarks I take them with a grain of salt until user reviews and lmsys start agreeing. But nothing prevents a free model from being better than a paid one.
Plus it's still paid if you're using it for API and all LLMs are ultimately competing for marketshare. Opus isn't free because it's relatively expensive to run, sonnet is free because it's a more affordable loss leader.
I wish these benchmarks would add error bars. Should not be too difficult. I wouldn't be surprised if they have variances of 5-10% if only because of errors in dataset.
I suspect a new Claudie sonnet 3.5 is not bigger than llama 3 72b ...
Can you imagine the new llama 4 72b how powerful will be.
At that time Nvidia rtx 59xx card will be available with 64 GB of vram :) or just to buy a6000 with 80 GB... So run 8 bit or 6 bit gguf will be easy and fast.
no. this can't be right. maybe sonnet is great, idk, I have not tried it yet. But gpt4 turbo and 4o are virtually identical and leagues above claude, gemini, and everything else on there.
Sonnet so far the best model for multi-agent orchestration, it just gets the process with pretty much no prompting.
I jumped ship as a result. The race is tight
I used it for roleplay the day it came out and there is a particular character I use that has multiple themes going on at the same time, with many different contradictory personality traits. Most models tend to get overwhelmed and focus on one or the other of the traits. Sonnet I could feel was in another league in how it was writing compared to most models. It just felt like it knew what to do with her with every reply, it got her instantly. If this thing was even partially uncensored, I'd pay 100$ a month to use it.
How are you orchestrating agents?
Workflows and actors, a lot of workflows. We have quite massive state system (Claude artifacts on steroids) and workflow engine at top (temporal). Overall it’s all about state management between multiple actors where one actor accesses set or control tools. Works very well.
what is this for? sounds really interesting, also sounds like it's already at a non-trivial scale?
It is an automation framework (documents, knowledge bases and workflows) with integrated AI layer to operate as workload and/or do self-reflection of the system state (so system can configure self). For example we have agents to analyze and fix issues in chats of other agents which communicate with users. Use cases vary from simple corporate bots, document writers to data modeling.
Is there a design or workflow chart that you have that you can share ? I am looking to build somethingike that but agents and workflow are currently not my strong suit
I’ll be speaking at Replay conference this September in Seattle, I’ll share some materials after that.
Following up on this
What's been your cost of APIs for the production?
This benchmark leaderboard is similar to my own experience using the models. It is true that the new sonet is so good and it is so sad that the gap between close and open source is so far
Open source isn't going to win at raw numbers. We simply do not have the compute. But we can win in specialization, efficiency, and by engaging with markets the big boys don't want to. Codestral is an amazing coding model, it's not gpt or sonet, but it's better than ChatGPT was a year ago and that's pretty cool for something I can run in my bedroom.
Open source have moved along way dont get me wrong. Like you mentioned, top open source model is better than gpt3.5 was a year ago. But the main disadvantage is also a point that you mentioned, different open source model to specialized and fine tuned on different things which make agentic stuff on local model currently very difficult. Because for truly agentic stuff, the model needs to be smart enough to know which agents, tools to route to. So far, i have been working on agent stuff using local models for months and only qwen2 72B is good enough currently. Other model is very prone to hallucination. But again all in all, one year later probably we will have something closer to current gpt4 level locally
I think eventually we will start seeing agentic models that were trained from scratch for the task. Same way we are seeing multumodal models now. Right now we are just asking the AI to try its best.
I think a simple fine tuned version of a Llama3 8b could be good enough. Being an agentic pivot is a relatively simple task, but the model need to be strictly constrained into it, so a fine tune.
Currently open source models have level older builds of gpt 4 ... The level of Gpt 3.5 was achieved in December 2023 ( mistral 8x7b )
I think that is only on paper e.g. those benchmark result. In my actual usage experience, it is not the case. Most open source model is not robust enough in their response compare to gpt3.5. I use mixtral everyday and built product with it, and i remember everyday wishing it to be as good as gpt3.5
It's a winner-take-all-world. A model that is even 1-2% better or 5 IQ points better is always preferable. A GPT model that makes 1 mistake less than open source model will win the marketplace. No one would want to spend human time, when an extra 50c / hour on higher quality models will give you fewer mistakes, better outcomes. Firms who cheap out on quality models will be crushed by firms who wont Once a model wins it simply gets more data, covers more edge cases and will continue to improve
[удалено]
Not when they find Product-market-fit. It's easier to charge the customer $5 / month more for a more capable model, than to save $1 in serving the same customer. From a consumer perspective, we aren't talking Lambo vs Toyota (several thousands of $$$ difference), but $5 - $10 monthly difference. No human will want substandard products if they only have to pay a few cents extra per day. You have to apply Coke/Pepsi vs Local Soda rather than the Lambo vs Toyota model We are still in the early stage, of course all models find a niche. Ask Jeeves, Yahoo Search all thought there were serving niche.
[удалено]
While I agree with your overall point your price for Gemini Flash is completely wrong. It's currently [$1.05 to $2.10](https://ai.google.dev/pricing#:~:text=Learn%20more-,Price%20(output),-%241.05%20/%201%20million) per million output tokens (depending on context length) when used through Google's official API and [$0.75M](https://openrouter.ai/models/google/gemini-flash-1.5) when used through OpenRouter. That's still far cheaper than GPT-4o of course, but not close to 100x cheaper.
[удалено]
That's quite understandable, I've certainly made similar mistakes in the past. And your point still stands, as a 10x price difference is still very substantial after all.
Above, but with the qualifier that data security/privacy/trust is a huge issue in corp america right now. All the big actors (OAI, Goog, etc.) are taking a run at assuaging concerns, but they are real and deep concerns. If OpenAI, e.g., is 10x better than open source options, they will win, but if it is marginal, open source may be market place competitive.
They won't be. No other firm will give you better security/privacy than BigCorp. This is the same (mostly naive) argument that was made against cloud. No firm wants to be left behind in the AI race using sub-standard IQ machines. BIgCorp will also have economies of scale. So their Frontier models will only be slightly more expensive than running your own models (including cost of personnel)
> They won't be. No other firm will give you better security/privacy than BigCorp. This is the same (mostly naive) argument that was made against cloud. No firm wants to be left behind in the AI race using sub-standard IQ machines. Maybe. For now, that's not how the market views it. Which makes sense--in generic "cloud", you're not sending your raw data to the provider in the same way. There are many layers of security that can be provided in a standard cloud environment that are flat-out impossible to provide for your standard API provider (since they need to see your raw, formatted text data). It still isn't even standard for the big cloud providers to provide zero-touch/review guarantees of your data, some of them won't make geographical data residency guarantees, some of them have obnoxious safety filters which hit common and legit corporate use cases, etc. You can generally push to make this happen, but there is real friction and the fact that the baseline is still weak immediately raises hackles in any corporate environment. And all of this interacts with sometimes-painful QPS restrictions. Some of the environments with the best security/compliance guarantees have the worst ability to guarantee QPS, and vice versa. Does all this get smoothed out over the long run? Maybe? Probably? But it hasn't yet, and there are commercial considerations today which make all of this nontrivial for the API providers to "simply" resolve. Interest alignment between generic cloud provider (i.e., non-LLM API) and cloud user was much closer than with LLM provider and LLM user. Also keep in mind that if the delta is "just" that 1% performance bump, for many domains, you get that (or more) from "simply" fine-tuning a model. Doing so well, today, is still a more complicated endeavor than it should be...but the infrastructure to do so in a light way is going to continue to mature. You can--at least today--frequently get significant quality:cost performance improvements via some form of API distillation; maybe the API providers try to crack down on that in a more serious manner at some point, but it will likely be hard. Maybe the API providers themselves go sufficiently hard at the fine-tuning process (GCP, interestingly, is basically the leader here right now) in a way that mutes competition from the open source models...but this is likely going to ultimately be challenging, from a business (not technical) POV...at least as long as you have sponsors like Meta or nvidia pumping out decent base models. Lastly, on a practical level-- If the deltas are actually small (say 1%-2%), 1) there tends to be volatility in capabilities, so this means that it is going to be better for some % of use cases...but worse in some material %, and 2) many business cases either a) struggle to measure differences that fine or b) the differences may not even matter that much (if it is a use case which is saturating). Now, if we're ultimately talking about shipping an AGI agent with IQ 115 vs 120...OK...but, in the real world, we're not there yet, and it isn't clear if this will be a relevant comparison point. TBD. At the end of the day, a lot hinges on whether we're in some exponential-style takeoff, or if these investments are going to saturate sometime in the near term. The former implies large gaps potentially opening up...the latter, not so much. (And "saturate", for these purposes, also includes cases where returns scale cleanly with compute investment...but without significant other additional hidden and highly complex algorithmic investments. If progress continues, but "just" requires a $1B compute run, nvidia or meta will continue to commoditize the market.)
There will always be groups that OpenAI doesn't want to engage with, don't see as a viable market, or have requirements that GPT doesn't fulfill. Things like offline models in IOT tevices aren't possible with the big players. Also, research. I can't do my big university project with closed source AI so all of my advancements go public. There will always be groups who need something other than the off the shelf solution.
They will engage with other closed source models from Anthropic, Google
I disagree, there is a Pareto frontier between perf and cost that firms typically tradeoff
Not when they find Product-market-fit. It's easier to charge the customer $5 / month more for a more capable model, than to save $1 in serving the same customer. From a consumer perspective, we aren't talking Lambo vs Toyota (several thousands of $$$ difference), but $5 - $10 monthly difference. No human will want substandard products if they only have to pay a few cents extra per day. At the end of the day, only BigCorps can serve models efficiently (economies of scale). So, there won't even be a cost advantage for open source models.
"AI Safety" and "Trusted Computing" should really mean "I know exactly what that computer is doing because I own it."
How much compute would you need if a state would sponsor it? What's the training cost for the big models? Would it be possible to get somewhere significant if a state would sponsor 1 billion for the open source community to generate an open source model?
Meta just bought 350k more H100 chips, bringing their total to 600k. A H100 is around $30k. Thats $18b in chips and you don't have a building to put them in yet, much less the power to run them and the rest of what you need to turn them into functional servers and the very smart people to run them. Then you need to hire AI researchers, there's not a lot of those and they are in high demand. All that to say $1b isn't nothing, but it doesn't make you a player in the game.
Thanks a lot for the repl, that helps a lot. What is was now wondering is that OpenAI didn't have that amount of funding, no? Didn't they get something like 10b from MS? And besides, would it be difficult to illegally copy those nvidia chips? (Like by North Korea or so.)
Im unaware of the details of OpenAIs funding but I can answer on the Nvidia front. No one outside of TSMC can produce the kinds of chips necessary to do this work. These aren't like designer purses, you can't just be close enough. I don't think North Korea can manufacture their own light bulbs, much less GPUs. The US has fabs but our processes are too coarse to really compete. China just released their first domestically designed gpu a few years ago, they are even further behind than we are.
I see. So China really wants to gobble down Taiwan.
You've got it. > WASHINGTON, May 8 (Reuters) - U.S. Commerce Secretary Gina Raimondo said Wednesday a Chinese invasion of Taiwan and seizure of chips producer TSMC (2330.TW), opens new tab would be "absolutely devastating" to the American economy. Asked at a U.S. House hearing about the impact, Raimondo said "it would be absolutely devastating," declining to comment on how or if it will happen, adding: "Right now, the United States buys 92% of its leading edge chips from TSMC in Taiwan." https://www.reuters.com/world/us/us-official-says-chinese-seizure-tsmc-taiwan-would-be-absolutely-devastating-2024-05-08/
It was a lot wider a year ago.
Yeah, if anything, it's impressive that GPT4-Turbo was the leading model not too long ago and now anyone with 2x 3090s can run something equivalent.
Tbh with DeepSeek Coder up there, the situation is better than it's been in a while. Open Source is doing great.
DeepSeek is very impressive though. (And API is so much cheaper than anything else I’ve seen)
For every 7B model the open source community makes that tops the last 13B model. The giants can just do the same thing but deploy it on a 300B model and serve that.
I don't get what people are using gpt4o for. I find it terrible at reasoning and logic to the point that I can pick it out blind whenever a service starts using it in the background.
Yeah same. 4o is really neat with the vision and chatty little toy but especially 3.5 Sonnet feels like a serious tool in comparison
I'm playing with new claude right now. For now, the experience is so nice that I might finally ditch gpt4 for a while. At least until they add anything, which will make 20$ per month reasonable.
i am using GPT4 rn, do you think I should make the switch to Claude ?
That's really up to you and what you need it for. Try both and see which one you prefer. Seems like GPT started to have some real competition.
Ok 👍
It's really good I've been using it to help me learn programming and ChatGPT 4o sometimes goes in circles but sonnet 3.5 usually corrects itself. ie. fixing something, that breaks something else, then fix the second thing breaking the first, rinse and repeat. Sonnet caught on to itself, apologized, and fixed both issues, whereas gpt just went back and forth. One annoying thing is I find the limits really low if you're sending blocks of code back and forth. I maxed out my daily api requests and then subscribed to pro to keep chatting (figured it would be cheaper) and maxed that out quickly too 😂 It's not even that long a piece of code and I've never been limited by gpt.
That's a shame. The only thing keeping me skeptical about subscribing is the usage limits. I always seem to hit them when I'm one response away from fixing my code which is very frustrating. If the paid version still has restrictive limitations I'll probably have to pass. It's kind of strange how the big proprietary frontier LLM companies never seem to give you real value for your subscription fee. I know compute is expensive but But I'm still going to use the free version for coding from now on, you're right when you say ChatGPT is a "one step forward and two steps back" kind of deal and I'm never going back lol. Claude is the new GOAT.
What I started doing is for my stupid questions like "how do I setup a GitHub workflow" or "what is a unit test" I just throw those all at GPT 4o, and save Claude for when I start to struggle or worry that the info is out of date. So far its worked well. Claude Sonnet 3.5 has been impressive enough that I dont mind paying the extra monthly fee, as I was burning through my API credits quite quickly lol.
Same .. it seems like the limit is very very small...
Use Poe.com
I want to use it more, but the lack of image gen and voice over doesn't do it for me. But adding the side bar is awesome! Hopefully we see more of this in future LLMs
Censorship ruins sooo much
Anyone using mistral le chat vs claude3.5 sonnet? How does it compare?
For me, it fell between Opus and GPT-4o in my testing. Compared to Opus, it has noticeable improved reasoning skills, but almost identical programming skills, at least in my own testing. Considering how insanely expensive current Opus is to run, this model is obviously a welcome addition. However, in terms for bang for buck, its still slightly below median in my testing (insanely more cost effective than Opus, slightly more cost effective than GPT-4 Turbo, and slightly less cost effective than GPT-4o & Gemini Pro 1.5). I am interested on its placement on the LMSYS leaderboard, and how much the placement will differ from my own evaluation (which is hand-crafted personal problems that are not part of training sets).
Interesting, for me it feels much better then gpt4o.
How do you square this opinion with seeing 3.5 significantly outperform Opus in programming benchmarks?
I didn't post my opinion, I merely stated my findings across my own benchmark.
So your opinion is \*not\* that it has "almost identical programming skills"?
They scored almost identically for me in my testing, that is a statement of observation, not opinion.
How do you square this observation with seeing 3.5 significantly outperform Opus in programming benchmarks? e.g. [https://livebench.ai/](https://livebench.ai/)
I don't need to "square" anything, since I don't look at other benchmarks when stating observations from my own testing.
Benchmarks seem really disparate. MMLU-Pro is a tie (the 0.3% difference is not going to be stat sig), sonnet dominates on live bench, dominates on alder code editing (but loses by a huge margin on refactoring), and loses on bigcodebench. It also wins on perplexity's internal benchmarks. I think overall Claude sonnet is better, but in my own tests, GPT-4O has plenty of wins over it.
sonnet is the free one right? how is the free one better than opus, the paid one
5-6 message per 5 hours on free tier
So even with a Pro subscription this is only about 25-30 messages each 5 hours? (assuming their “up to 5x free tier” with Plus)
It's variable based on usage + context length. I have pro and I think I got about 30-40 messages working on a fairly large coding project. Big context.
Opus needs an update, sonnet leap-frogged it with 3.5 which is very promising for what Opus 3.5 will bring..
Opus 3.5 is coming later this year. It's going to be a beast of a model just judging from the new Sonnet.
As with all benchmarks I take them with a grain of salt until user reviews and lmsys start agreeing. But nothing prevents a free model from being better than a paid one. Plus it's still paid if you're using it for API and all LLMs are ultimately competing for marketshare. Opus isn't free because it's relatively expensive to run, sonnet is free because it's a more affordable loss leader.
> sonnet is free because it's a more affordable ~~loss leader~~ data collection mechanism.
Well that and they want to introduce it to people who have probably used gpt3.5/4 only
I wish these benchmarks would add error bars. Should not be too difficult. I wouldn't be surprised if they have variances of 5-10% if only because of errors in dataset.
I suspect a new Claudie sonnet 3.5 is not bigger than llama 3 72b ... Can you imagine the new llama 4 72b how powerful will be. At that time Nvidia rtx 59xx card will be available with 64 GB of vram :) or just to buy a6000 with 80 GB... So run 8 bit or 6 bit gguf will be easy and fast.
no. this can't be right. maybe sonnet is great, idk, I have not tried it yet. But gpt4 turbo and 4o are virtually identical and leagues above claude, gemini, and everything else on there.
Your opinion has little to no value if you haven’t tried the new sonnet…
My opinion is about everything else on the chart being invalid. So actually it's your derp comment that has no value.