Details about the Yi Chat model? #1

findmyway · 2024-01-24T06:13:24Z

Hello from the Yi team ;)

Based on the evaluation result, it looks you're using the default https://ollama.ai/library/yi:34b-chat ollama 4-bits quantized version, am I right?

Have you ever tried the version before quantization?

It seems we share some common interests in both code LLM and Julia. Maybe we can collaborate somehow in the future.

svilupp · 2024-01-24T09:26:03Z

Oh, fun!

Re yi34b, you're spot on. It's only the q4_0. I picked many Ollama defaults because I assumed that's what most beginner users pick. I'm not sure why they chose q4_0; in most tests, the difference to Q4_K_M is sizeable.

I can't run FP16 locally, so it's waiting for me to set up some cloud GPU pipeline to evaluate these bigger options.
The biggest I could run locally is probably Q5_K_M.

The three things I would call out though are that:

we have so few samples of evals that being within a few points of the other models means very little
sub-50 scores usually mean that it made a lot of syntactic errors (parseable but flawed, eg, method errors). You can trace most of these to Pythonic syntax
sometimes the errors are on the auto-eval side (eg, the model emits an extra example which is flawed, so everything fails... I catch a lot of it, but there are still a few failure cases that are not "fair")

The biggest bottleneck for OSS models on Julia performance that I see is to get the basics right. That's something that should be easily solvable with a reflexion/fixer loop. However, when I tried to build a naive agent loop, it didn't work well. I have a LATS-like agent drafted but haven't had the time to finish it.

That reminds me, I have 4 blog posts drafted that I never published because they are not very exciting/complete:

Learnings from building this benchmark (and what's actually hard)
What's hard about Julia for OSS models
Quantization effect on performance (you'll notice that I did run a few different quants for some models)
How I failed to build an agent code-fixer

If you'd be interested in one of them, I could prioritize it.

In terms of collaboration, there could be a few things:

if interesting, I can help you explore where Yi goes wrong (I keep the eval traces)
if you have the hardware, we could run all Yi quants with enough samples to provide some guidance?
what could we do to improve Julia's representation in LLM train data / is it realistic for people to include (given it's not represented well in benchmarks which is what everyone cares about) / create a fine-tune for Julia coding?
my personal interest is LLM agents (wide range of use cases, Julia DSL), but I haven't had time to build anything yet, because we were missing all the basics
(personal take: If we don't manage to shift some of the open-source burden + grunt work to LLM agents (as I expect Python ecosystem will), Julia will become even less relevant, as it will no longer be "productive" to use)

You can find me on Julia Slack :-)

findmyway · 2024-01-24T18:10:02Z

Thanks for your detailed reply!

I picked many Ollama defaults because I assumed that's what most beginner users pick.

That makes sense. We received some feedbacks from developers that the Yi-34B-Chat model is not very good at math and code. I guess things are even worse with the quantized version.

The biggest bottleneck for OSS models on Julia performance that I see is to get the basics right. That's something that should be easily solvable with a reflexion/fixer loop.

Indeed. Based on my experience, it might be even easier to fix with an extra finetuning step.

Learnings from building this benchmark (and what's actually hard)

Quantization effect on performance (you'll notice that I did run a few different quants for some models)

Personally I'm pretty interested in these two topics. 😄

if interesting, I can help you explore where Yi goes wrong (I keep the eval traces)

That would be much appreciated!

if you have the hardware, we could run all Yi quants with enough samples to provide some guidance?

Yes, we can provide the computing resource to support such work. I'll discuss details with you on slack later.

is it realistic for people to include (given it's not represented well in benchmarks which is what everyone cares about)

Actually I'm working on porting the HumanEval dataset with EvalPlus test cases into Julia (https://github.com/oolong-dev/OolongEval.jl/tree/add_human_eval still in progress). Hopefully it will provide a different measurement for reachers to evaluate LLM's performance on under presented programming languages.

my personal interest is LLM agents (wide range of use cases, Julia DSL), but I haven't had time to build anything yet, because we were missing all the basics

Same here 👋

svilupp · 2024-01-24T18:27:13Z

Exciting! When it's ready, let's make sure we put it on the awesome list: https://github.com/svilupp/awesome-generative-ai-meets-julia-language

findmyway closed this as completed Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details about the Yi Chat model? #1

Details about the Yi Chat model? #1

findmyway commented Jan 24, 2024

svilupp commented Jan 24, 2024

findmyway commented Jan 24, 2024

svilupp commented Jan 24, 2024

Details about the Yi Chat model? #1

Details about the Yi Chat model? #1

Comments

findmyway commented Jan 24, 2024

svilupp commented Jan 24, 2024

findmyway commented Jan 24, 2024

svilupp commented Jan 24, 2024