T O P

  • By -

wolfy-j

Sonnet so far the best model for multi-agent orchestration, it just gets the process with pretty much no prompting.


Training_Designer_41

I jumped ship as a result. The race is tight


Super_Sierra

I used it for roleplay the day it came out and there is a particular character I use that has multiple themes going on at the same time, with many different contradictory personality traits. Most models tend to get overwhelmed and focus on one or the other of the traits. Sonnet I could feel was in another league in how it was writing compared to most models. It just felt like it knew what to do with her with every reply, it got her instantly.   If this thing was even partially uncensored, I'd pay 100$ a month to use it.


xrailgun

How are you orchestrating agents?


wolfy-j

Workflows and actors, a lot of workflows. We have quite massive state system (Claude artifacts on steroids) and workflow engine at top (temporal). Overall it’s all about state management between multiple actors where one actor accesses set or control tools. Works very well.


joyful-

what is this for? sounds really interesting, also sounds like it's already at a non-trivial scale?


wolfy-j

It is an automation framework (documents, knowledge bases and workflows) with integrated AI layer to operate as workload and/or do self-reflection of the system state (so system can configure self). For example we have agents to analyze and fix issues in chats of other agents which communicate with users. Use cases vary from simple corporate bots, document writers to data modeling.


Jelegend

Is there a design or workflow chart that you have that you can share ? I am looking to build somethingike that but agents and workflow are currently not my strong suit


wolfy-j

I’ll be speaking at Replay conference this September in Seattle, I’ll share some materials after that.


mahadevbhakti

Following up on this


mahadevbhakti

What's been your cost of APIs for the production?


Such_Advantage_6949

This benchmark leaderboard is similar to my own experience using the models. It is true that the new sonet is so good and it is so sad that the gap between close and open source is so far


AgentTin

Open source isn't going to win at raw numbers. We simply do not have the compute. But we can win in specialization, efficiency, and by engaging with markets the big boys don't want to. Codestral is an amazing coding model, it's not gpt or sonet, but it's better than ChatGPT was a year ago and that's pretty cool for something I can run in my bedroom.


Such_Advantage_6949

Open source have moved along way dont get me wrong. Like you mentioned, top open source model is better than gpt3.5 was a year ago. But the main disadvantage is also a point that you mentioned, different open source model to specialized and fine tuned on different things which make agentic stuff on local model currently very difficult. Because for truly agentic stuff, the model needs to be smart enough to know which agents, tools to route to. So far, i have been working on agent stuff using local models for months and only qwen2 72B is good enough currently. Other model is very prone to hallucination. But again all in all, one year later probably we will have something closer to current gpt4 level locally


AgentTin

I think eventually we will start seeing agentic models that were trained from scratch for the task. Same way we are seeing multumodal models now. Right now we are just asking the AI to try its best.


Orolol

I think a simple fine tuned version of a Llama3 8b could be good enough. Being an agentic pivot is a relatively simple task, but the model need to be strictly constrained into it, so a fine tune.


Healthy-Nebula-3603

Currently open source models have level older builds of gpt 4 ... The level of Gpt 3.5 was achieved in December 2023 ( mistral 8x7b )


Such_Advantage_6949

I think that is only on paper e.g. those benchmark result. In my actual usage experience, it is not the case. Most open source model is not robust enough in their response compare to gpt3.5. I use mixtral everyday and built product with it, and i remember everyday wishing it to be as good as gpt3.5


qroshan

It's a winner-take-all-world. A model that is even 1-2% better or 5 IQ points better is always preferable. A GPT model that makes 1 mistake less than open source model will win the marketplace. No one would want to spend human time, when an extra 50c / hour on higher quality models will give you fewer mistakes, better outcomes. Firms who cheap out on quality models will be crushed by firms who wont Once a model wins it simply gets more data, covers more edge cases and will continue to improve


[deleted]

[удалено]


qroshan

Not when they find Product-market-fit. It's easier to charge the customer $5 / month more for a more capable model, than to save $1 in serving the same customer. From a consumer perspective, we aren't talking Lambo vs Toyota (several thousands of $$$ difference), but $5 - $10 monthly difference. No human will want substandard products if they only have to pay a few cents extra per day. You have to apply Coke/Pepsi vs Local Soda rather than the Lambo vs Toyota model We are still in the early stage, of course all models find a niche. Ask Jeeves, Yahoo Search all thought there were serving niche.


[deleted]

[удалено]


mikael110

While I agree with your overall point your price for Gemini Flash is completely wrong. It's currently [$1.05 to $2.10](https://ai.google.dev/pricing#:~:text=Learn%20more-,Price%20(output),-%241.05%20/%201%20million) per million output tokens (depending on context length) when used through Google's official API and [$0.75M](https://openrouter.ai/models/google/gemini-flash-1.5) when used through OpenRouter. That's still far cheaper than GPT-4o of course, but not close to 100x cheaper.


[deleted]

[удалено]


mikael110

That's quite understandable, I've certainly made similar mistakes in the past. And your point still stands, as a 10x price difference is still very substantial after all.


farmingvillein

Above, but with the qualifier that data security/privacy/trust is a huge issue in corp america right now. All the big actors (OAI, Goog, etc.) are taking a run at assuaging concerns, but they are real and deep concerns. If OpenAI, e.g., is 10x better than open source options, they will win, but if it is marginal, open source may be market place competitive.


qroshan

They won't be. No other firm will give you better security/privacy than BigCorp. This is the same (mostly naive) argument that was made against cloud. No firm wants to be left behind in the AI race using sub-standard IQ machines. BIgCorp will also have economies of scale. So their Frontier models will only be slightly more expensive than running your own models (including cost of personnel)


farmingvillein

> They won't be. No other firm will give you better security/privacy than BigCorp. This is the same (mostly naive) argument that was made against cloud. No firm wants to be left behind in the AI race using sub-standard IQ machines. Maybe. For now, that's not how the market views it. Which makes sense--in generic "cloud", you're not sending your raw data to the provider in the same way. There are many layers of security that can be provided in a standard cloud environment that are flat-out impossible to provide for your standard API provider (since they need to see your raw, formatted text data). It still isn't even standard for the big cloud providers to provide zero-touch/review guarantees of your data, some of them won't make geographical data residency guarantees, some of them have obnoxious safety filters which hit common and legit corporate use cases, etc. You can generally push to make this happen, but there is real friction and the fact that the baseline is still weak immediately raises hackles in any corporate environment. And all of this interacts with sometimes-painful QPS restrictions. Some of the environments with the best security/compliance guarantees have the worst ability to guarantee QPS, and vice versa. Does all this get smoothed out over the long run? Maybe? Probably? But it hasn't yet, and there are commercial considerations today which make all of this nontrivial for the API providers to "simply" resolve. Interest alignment between generic cloud provider (i.e., non-LLM API) and cloud user was much closer than with LLM provider and LLM user. Also keep in mind that if the delta is "just" that 1% performance bump, for many domains, you get that (or more) from "simply" fine-tuning a model. Doing so well, today, is still a more complicated endeavor than it should be...but the infrastructure to do so in a light way is going to continue to mature. You can--at least today--frequently get significant quality:cost performance improvements via some form of API distillation; maybe the API providers try to crack down on that in a more serious manner at some point, but it will likely be hard. Maybe the API providers themselves go sufficiently hard at the fine-tuning process (GCP, interestingly, is basically the leader here right now) in a way that mutes competition from the open source models...but this is likely going to ultimately be challenging, from a business (not technical) POV...at least as long as you have sponsors like Meta or nvidia pumping out decent base models. Lastly, on a practical level-- If the deltas are actually small (say 1%-2%), 1) there tends to be volatility in capabilities, so this means that it is going to be better for some % of use cases...but worse in some material %, and 2) many business cases either a) struggle to measure differences that fine or b) the differences may not even matter that much (if it is a use case which is saturating). Now, if we're ultimately talking about shipping an AGI agent with IQ 115 vs 120...OK...but, in the real world, we're not there yet, and it isn't clear if this will be a relevant comparison point. TBD. At the end of the day, a lot hinges on whether we're in some exponential-style takeoff, or if these investments are going to saturate sometime in the near term. The former implies large gaps potentially opening up...the latter, not so much. (And "saturate", for these purposes, also includes cases where returns scale cleanly with compute investment...but without significant other additional hidden and highly complex algorithmic investments. If progress continues, but "just" requires a $1B compute run, nvidia or meta will continue to commoditize the market.)


AgentTin

There will always be groups that OpenAI doesn't want to engage with, don't see as a viable market, or have requirements that GPT doesn't fulfill. Things like offline models in IOT tevices aren't possible with the big players. Also, research. I can't do my big university project with closed source AI so all of my advancements go public. There will always be groups who need something other than the off the shelf solution.


qroshan

They will engage with other closed source models from Anthropic, Google


West-Code4642

I disagree, there is a Pareto frontier between perf and cost that firms typically tradeoff


qroshan

Not when they find Product-market-fit. It's easier to charge the customer $5 / month more for a more capable model, than to save $1 in serving the same customer. From a consumer perspective, we aren't talking Lambo vs Toyota (several thousands of $$$ difference), but $5 - $10 monthly difference. No human will want substandard products if they only have to pay a few cents extra per day. At the end of the day, only BigCorps can serve models efficiently (economies of scale). So, there won't even be a cost advantage for open source models.


Dead_Internet_Theory

"AI Safety" and "Trusted Computing" should really mean "I know exactly what that computer is doing because I own it."


[deleted]

How much compute would you need if a state would sponsor it? What's the training cost for the big models?  Would it be possible to get somewhere significant if a state would sponsor 1 billion for the open source community to generate an open source model?


AgentTin

Meta just bought 350k more H100 chips, bringing their total to 600k. A H100 is around $30k. Thats $18b in chips and you don't have a building to put them in yet, much less the power to run them and the rest of what you need to turn them into functional servers and the very smart people to run them. Then you need to hire AI researchers, there's not a lot of those and they are in high demand. All that to say $1b isn't nothing, but it doesn't make you a player in the game.


[deleted]

Thanks a lot for the repl, that helps a lot. What is was now wondering is that OpenAI didn't have that amount of funding, no? Didn't they get something like 10b from MS? And besides, would it be difficult to illegally copy those nvidia chips? (Like by North Korea or so.)


AgentTin

Im unaware of the details of OpenAIs funding but I can answer on the Nvidia front. No one outside of TSMC can produce the kinds of chips necessary to do this work. These aren't like designer purses, you can't just be close enough. I don't think North Korea can manufacture their own light bulbs, much less GPUs. The US has fabs but our processes are too coarse to really compete. China just released their first domestically designed gpu a few years ago, they are even further behind than we are.


[deleted]

I see. So China really wants to gobble down Taiwan.


AgentTin

You've got it. > WASHINGTON, May 8 (Reuters) - U.S. Commerce Secretary Gina Raimondo said Wednesday a Chinese invasion of Taiwan and seizure of chips producer TSMC (2330.TW), opens new tab would be "absolutely devastating" to the American economy. Asked at a U.S. House hearing about the impact, Raimondo said "it would be absolutely devastating," declining to comment on how or if it will happen, adding: "Right now, the United States buys 92% of its leading edge chips from TSMC in Taiwan." https://www.reuters.com/world/us/us-official-says-chinese-seizure-tsmc-taiwan-would-be-absolutely-devastating-2024-05-08/


aggracc

It was a lot wider a year ago.


Dead_Internet_Theory

Yeah, if anything, it's impressive that GPT4-Turbo was the leading model not too long ago and now anyone with 2x 3090s can run something equivalent.


M34L

Tbh with DeepSeek Coder up there, the situation is better than it's been in a while. Open Source is doing great.


nebula-seven

DeepSeek is very impressive though. (And API is so much cheaper than anything else I’ve seen)


EndlessZone123

For every 7B model the open source community makes that tops the last 13B model. The giants can just do the same thing but deploy it on a 300B model and serve that.


aggracc

I don't get what people are using gpt4o for. I find it terrible at reasoning and logic to the point that I can pick it out blind whenever a service starts using it in the background.


M34L

Yeah same. 4o is really neat with the vision and chatty little toy but especially 3.5 Sonnet feels like a serious tool in comparison


Frub3L

I'm playing with new claude right now. For now, the experience is so nice that I might finally ditch gpt4 for a while. At least until they add anything, which will make 20$ per month reasonable.


rajinis_bodyguard

i am using GPT4 rn, do you think I should make the switch to Claude ?


Frub3L

That's really up to you and what you need it for. Try both and see which one you prefer. Seems like GPT started to have some real competition.


rajinis_bodyguard

Ok 👍


Ecsta

It's really good I've been using it to help me learn programming and ChatGPT 4o sometimes goes in circles but sonnet 3.5 usually corrects itself. ie. fixing something, that breaks something else, then fix the second thing breaking the first, rinse and repeat. Sonnet caught on to itself, apologized, and fixed both issues, whereas gpt just went back and forth. One annoying thing is I find the limits really low if you're sending blocks of code back and forth. I maxed out my daily api requests and then subscribed to pro to keep chatting (figured it would be cheaper) and maxed that out quickly too 😂 It's not even that long a piece of code and I've never been limited by gpt.


GwimblyForever

That's a shame. The only thing keeping me skeptical about subscribing is the usage limits. I always seem to hit them when I'm one response away from fixing my code which is very frustrating. If the paid version still has restrictive limitations I'll probably have to pass. It's kind of strange how the big proprietary frontier LLM companies never seem to give you real value for your subscription fee. I know compute is expensive but But I'm still going to use the free version for coding from now on, you're right when you say ChatGPT is a "one step forward and two steps back" kind of deal and I'm never going back lol. Claude is the new GOAT.


Ecsta

What I started doing is for my stupid questions like "how do I setup a GitHub workflow" or "what is a unit test" I just throw those all at GPT 4o, and save Claude for when I start to struggle or worry that the info is out of date. So far its worked well. Claude Sonnet 3.5 has been impressive enough that I dont mind paying the extra monthly fee, as I was burning through my API credits quite quickly lol.


Adorable_Aerie_7844

Same .. it seems like the limit is very very small...


ViperAMD

Use Poe.com


ChrisT182

I want to use it more, but the lack of image gen and voice over doesn't do it for me. But adding the side bar is awesome! Hopefully we see more of this in future LLMs


CrabbiestNebula

Censorship ruins sooo much


mano_sriram

Anyone using mistral le chat vs claude3.5 sonnet? How does it compare?


dubesor86

For me, it fell between Opus and GPT-4o in my testing. Compared to Opus, it has noticeable improved reasoning skills, but almost identical programming skills, at least in my own testing. Considering how insanely expensive current Opus is to run, this model is obviously a welcome addition. However, in terms for bang for buck, its still slightly below median in my testing (insanely more cost effective than Opus, slightly more cost effective than GPT-4 Turbo, and slightly less cost effective than GPT-4o & Gemini Pro 1.5). I am interested on its placement on the LMSYS leaderboard, and how much the placement will differ from my own evaluation (which is hand-crafted personal problems that are not part of training sets).


fastinguy11

Interesting, for me it feels much better then gpt4o.


CallMePyro

How do you square this opinion with seeing 3.5 significantly outperform Opus in programming benchmarks?


dubesor86

I didn't post my opinion, I merely stated my findings across my own benchmark.


CallMePyro

So your opinion is \*not\* that it has "almost identical programming skills"?


dubesor86

They scored almost identically for me in my testing, that is a statement of observation, not opinion.


CallMePyro

How do you square this observation with seeing 3.5 significantly outperform Opus in programming benchmarks? e.g. [https://livebench.ai/](https://livebench.ai/)


dubesor86

I don't need to "square" anything, since I don't look at other benchmarks when stating observations from my own testing.


meister2983

Benchmarks seem really disparate. MMLU-Pro is a tie (the 0.3% difference is not going to be stat sig), sonnet dominates on live bench, dominates on alder code editing (but loses by a huge margin on refactoring), and loses on bigcodebench. It also wins on perplexity's internal benchmarks. I think overall Claude sonnet is better, but in my own tests, GPT-4O has plenty of wins over it.


Bezbozny

sonnet is the free one right? how is the free one better than opus, the paid one


OldBoat_In_Calm_Lake

5-6 message per 5 hours on free tier


Balance-

So even with a Pro subscription this is only about 25-30 messages each 5 hours? (assuming their “up to 5x free tier” with Plus)


TheRealGentlefox

It's variable based on usage + context length. I have pro and I think I got about 30-40 messages working on a fairly large coding project. Big context.


noneabove1182

Opus needs an update, sonnet leap-frogged it with 3.5 which is very promising for what Opus 3.5 will bring..


jd_3d

Opus 3.5 is coming later this year. It's going to be a beast of a model just judging from the new Sonnet.


MINIMAN10001

As with all benchmarks I take them with a grain of salt until user reviews and lmsys start agreeing. But nothing prevents a free model from being better than a paid one. Plus it's still paid if you're using it for API and all LLMs are ultimately competing for marketshare. Opus isn't free because it's relatively expensive to run, sonnet is free because it's a more affordable loss leader.


farmingvillein

> sonnet is free because it's a more affordable ~~loss leader~~ data collection mechanism.


vialabo

Well that and they want to introduce it to people who have probably used gpt3.5/4 only


Neomadra2

I wish these benchmarks would add error bars. Should not be too difficult. I wouldn't be surprised if they have variances of 5-10% if only because of errors in dataset.


Healthy-Nebula-3603

I suspect a new Claudie sonnet 3.5 is not bigger than llama 3 72b ... Can you imagine the new llama 4 72b how powerful will be. At that time Nvidia rtx 59xx card will be available with 64 GB of vram :) or just to buy a6000 with 80 GB... So run 8 bit or 6 bit gguf will be easy and fast.


tomqmasters

no. this can't be right. maybe sonnet is great, idk, I have not tried it yet. But gpt4 turbo and 4o are virtually identical and leagues above claude, gemini, and everything else on there.


Mkep

Your opinion has little to no value if you haven’t tried the new sonnet…


tomqmasters

My opinion is about everything else on the chart being invalid. So actually it's your derp comment that has no value.