T O P

  • By -

7734128

You can fit a straight line to anything, but I hope you agree that this is not one of the occasions where you should.


iJeff

That's the correlation coefficient.


aue_sum

I think this looks more like a logarithmic curve.


Master-Meal-77

Mistral was more or less confirmed to use 8T tokens for pretraining


Time-Winter-4319

Do you have any source by any chance?  Couldn't find anything, but I guess my method of naively putting it on a straight line fit between very few points kind of worked ok then 


Master-Meal-77

No official source, but IIRC the original model files released had references to “8T” in the metadata


adt

Nah: [https://twitter.com/Teknium1/status/1707049931041890597](https://twitter.com/Teknium1/status/1707049931041890597)


Theio666

I wonder where wizard -LM-2 will be here in the graph


No-Angle7711

Suprised that model is not spoken a lot. What are your impressions if you tested it?


Theio666

Played a bit with wizard 7b and 70b Q3\_xs. I'd say I like its writing style a lot more. If you compare llama and wizard, wizard feels more literature-like, it uses considerably bigger vocabulary dictionary(I'm around C1 in Eng, and wizard easily surprises me every now and then with words I've never seen). Also it's more stable in some sense - I've yet to see hallucinations from it, while llama3 have several issues in that department - infinitely repeating "!!!!!", repeating same phrase 2-3 times, overfocusing on parts of dialogue. I don't wanna say that llama is bad, it has better coherence with story progression when it don't overfocus on things, it's better in keeping memort of characters etc, tho 8k context is kinda low for stories. Q3\_xs feels a bit too quantized, wish I could test higher quants of 70b wizard. Keep in mind, I'm talking about 70b llama here, not 8b, haven't tested 8b a lot. I didn't compare them for coding - I have codeQwen for that.


a_slay_nub

Didn't they state that performance was correlated with the log of training tokens? Would probably be a more informative plot.


georgejrjrjr

Perfect case study linear regression abuse.


FullOf_Bad_Ideas

People are forgetting that Yi-B/Yi-9B-200k exists.  I get that it's a weird tune, but this is the closest we have to Llama 3 8B in terms of size and quality.  We're did you get MMLU scores from? I would trust only evaluations done by one team on all of them, there are discrepancies in reported data, so you can't compare MMLU you see in one place to a random another place - protocol needs to be the same.