some notes about how ggml works using the GPT-2 example #716
chunhualiao
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
I am also trying to understand the e2e ggml flow. I am stuck at the memory alloc part. It seems like there are lot of diff buffers that needs to be allocated apart from tensor data buffer and programmers need to take care of overheads manually in GMML. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
It might be useful for this library to have some documentation about how it works internally.
I may have missed such documentation somewhere so I went ahead to debug the inference example using gpt-2 117M.
Some initial notes are put here https://gist.github.com/chunhualiao/8610c8a3afa3ef76c0174c57ff6e5339
This may be useful for beginners though it may contain errors of course.
A snapshot of my notes is copied below:
I have taken quite some machine learning courses and have done a few projects already. I think I know the math formula involved in transformers and GPT models. However, I always wondered how they work in reality. The best way for me is to read and understand source codes implementing these models.
I am a C/C++ programmer mostly. I am more comfortable to read C/C++ programs. So, recently I started to read, run, and debug ggml's gpt-2 inference example since ggml is entirely written in C and can run many transformer models on a laptop: https://github.com/ggerganov/ggml/tree/master/examples/gpt-2 . The famous llama.cpp is closely connected to this library. My experiment environment is a MacBook Pro laptop+ Visual Studio Code + cmake+ CodeLLDB (gdb does not work with my M2 chip), and GPT-2 117 M model.
Here is what I have learned so far:
The high-level main function has the following structure https://github.com/ggerganov/ggml/blob/master/examples/gpt-2/main-backend.cpp
The core computation is done using the compute graph.
ggml provides quite some tools to dump or visualize the compute graph, which helps debug the inference process. https://netron.app/ also can visualize common model files hosted on huggingface. I tried to upload huggingface GPT-2 model to netron. It is fascinating to view the compute graph of a transformer model.
ggml has many other advanced features including running computation on GPUs, using multi-threaded programming, and so on.
Even for a small model like GPT-2 117M, the compute graph is quite large (leaf nodes 188 + non-leaf nodes 487). I will need more time to go through the graph to have a deeper understanding of how all the math formula of transformers is implemented in a programming language.
I have tremendous respect for ggml/llama.cpp's author: Georgi Gerganov. What a genius to pull off some projects like this!
Beta Was this translation helpful? Give feedback.
All reactions