Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details about the Yi Chat model? #1

Closed
findmyway opened this issue Jan 24, 2024 · 3 comments
Closed

Details about the Yi Chat model? #1

findmyway opened this issue Jan 24, 2024 · 3 comments

Comments

@findmyway
Copy link

Hello from the Yi team ;)

Based on the evaluation result, it looks you're using the default https://ollama.ai/library/yi:34b-chat ollama 4-bits quantized version, am I right?

Have you ever tried the version before quantization?


It seems we share some common interests in both code LLM and Julia. Maybe we can collaborate somehow in the future.

@svilupp
Copy link
Owner

svilupp commented Jan 24, 2024

Oh, fun!

Re yi34b, you're spot on. It's only the q4_0. I picked many Ollama defaults because I assumed that's what most beginner users pick. I'm not sure why they chose q4_0; in most tests, the difference to Q4_K_M is sizeable.

I can't run FP16 locally, so it's waiting for me to set up some cloud GPU pipeline to evaluate these bigger options.
The biggest I could run locally is probably Q5_K_M.

The three things I would call out though are that:

  • we have so few samples of evals that being within a few points of the other models means very little
  • sub-50 scores usually mean that it made a lot of syntactic errors (parseable but flawed, eg, method errors). You can trace most of these to Pythonic syntax
  • sometimes the errors are on the auto-eval side (eg, the model emits an extra example which is flawed, so everything fails... I catch a lot of it, but there are still a few failure cases that are not "fair")

The biggest bottleneck for OSS models on Julia performance that I see is to get the basics right. That's something that should be easily solvable with a reflexion/fixer loop. However, when I tried to build a naive agent loop, it didn't work well. I have a LATS-like agent drafted but haven't had the time to finish it.

That reminds me, I have 4 blog posts drafted that I never published because they are not very exciting/complete:

  • Learnings from building this benchmark (and what's actually hard)
  • What's hard about Julia for OSS models
  • Quantization effect on performance (you'll notice that I did run a few different quants for some models)
  • How I failed to build an agent code-fixer

If you'd be interested in one of them, I could prioritize it.


In terms of collaboration, there could be a few things:

  • if interesting, I can help you explore where Yi goes wrong (I keep the eval traces)
  • if you have the hardware, we could run all Yi quants with enough samples to provide some guidance?
  • what could we do to improve Julia's representation in LLM train data / is it realistic for people to include (given it's not represented well in benchmarks which is what everyone cares about) / create a fine-tune for Julia coding?
  • my personal interest is LLM agents (wide range of use cases, Julia DSL), but I haven't had time to build anything yet, because we were missing all the basics
    (personal take: If we don't manage to shift some of the open-source burden + grunt work to LLM agents (as I expect Python ecosystem will), Julia will become even less relevant, as it will no longer be "productive" to use)

You can find me on Julia Slack :-)

@findmyway
Copy link
Author

Thanks for your detailed reply!

I picked many Ollama defaults because I assumed that's what most beginner users pick.

That makes sense. We received some feedbacks from developers that the Yi-34B-Chat model is not very good at math and code. I guess things are even worse with the quantized version.

The biggest bottleneck for OSS models on Julia performance that I see is to get the basics right. That's something that should be easily solvable with a reflexion/fixer loop.

Indeed. Based on my experience, it might be even easier to fix with an extra finetuning step.

  • Learnings from building this benchmark (and what's actually hard)
  • Quantization effect on performance (you'll notice that I did run a few different quants for some models)

Personally I'm pretty interested in these two topics. 😄

  • if interesting, I can help you explore where Yi goes wrong (I keep the eval traces)

That would be much appreciated!

  • if you have the hardware, we could run all Yi quants with enough samples to provide some guidance?

Yes, we can provide the computing resource to support such work. I'll discuss details with you on slack later.

  • is it realistic for people to include (given it's not represented well in benchmarks which is what everyone cares about)

Actually I'm working on porting the HumanEval dataset with EvalPlus test cases into Julia (https://github.com/oolong-dev/OolongEval.jl/tree/add_human_eval still in progress). Hopefully it will provide a different measurement for reachers to evaluate LLM's performance on under presented programming languages.

  • my personal interest is LLM agents (wide range of use cases, Julia DSL), but I haven't had time to build anything yet, because we were missing all the basics

Same here 👋

@svilupp
Copy link
Owner

svilupp commented Jan 24, 2024

Exciting! When it's ready, let's make sure we put it on the awesome list: https://github.com/svilupp/awesome-generative-ai-meets-julia-language

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants