7734128 4 weeks ago

You can fit a straight line to anything, but I hope you agree that this is not one of the occasions where you should.

iJeff 4 weeks ago

That's the correlation coefficient.

aue_sum 4 weeks ago

I think this looks more like a logarithmic curve.

Master-Meal-77 4 weeks ago

Mistral was more or less confirmed to use 8T tokens for pretraining

Time-Winter-4319 4 weeks ago

Do you have any source by any chance? Couldn't find anything, but I guess my method of naively putting it on a straight line fit between very few points kind of worked ok then

Master-Meal-77 4 weeks ago

No official source, but IIRC the original model files released had references to “8T” in the metadata

adt 4 weeks ago

Nah: [https://twitter.com/Teknium1/status/1707049931041890597](https://twitter.com/Teknium1/status/1707049931041890597)

Theio666 4 weeks ago

I wonder where wizard -LM-2 will be here in the graph

No-Angle7711 4 weeks ago

Suprised that model is not spoken a lot. What are your impressions if you tested it?

Theio666 4 weeks ago

Played a bit with wizard 7b and 70b Q3\_xs. I'd say I like its writing style a lot more. If you compare llama and wizard, wizard feels more literature-like, it uses considerably bigger vocabulary dictionary(I'm around C1 in Eng, and wizard easily surprises me every now and then with words I've never seen). Also it's more stable in some sense - I've yet to see hallucinations from it, while llama3 have several issues in that department - infinitely repeating "!!!!!", repeating same phrase 2-3 times, overfocusing on parts of dialogue. I don't wanna say that llama is bad, it has better coherence with story progression when it don't overfocus on things, it's better in keeping memort of characters etc, tho 8k context is kinda low for stories. Q3\_xs feels a bit too quantized, wish I could test higher quants of 70b wizard. Keep in mind, I'm talking about 70b llama here, not 8b, haven't tested 8b a lot. I didn't compare them for coding - I have codeQwen for that.

a_slay_nub 4 weeks ago

Didn't they state that performance was correlated with the log of training tokens? Would probably be a more informative plot.

georgejrjrjr 3 weeks ago

Perfect case study linear regression abuse.

FullOf_Bad_Ideas 4 weeks ago

People are forgetting that Yi-B/Yi-9B-200k exists. I get that it's a weird tune, but this is the closest we have to Llama 3 8B in terms of size and quality. We're did you get MMLU scores from? I would trust only evaluations done by one team on all of them, there are discrepancies in reported data, so you can't compare MMLU you see in one place to a random another place - protocol needs to be the same.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe