Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time consumption? #2

Open
ZifengDing opened this issue Sep 12, 2023 · 2 comments
Open

Time consumption? #2

ZifengDing opened this issue Sep 12, 2023 · 2 comments

Comments

@ZifengDing
Copy link

Thanks for the impressive work.
I was also trying to fintune LLAMA 7b for KG link prediction on my own dataset.
I was using huggingface trainer and it cost me huge amount of time for finetuing.
As the training time is not indicated in your paper, may I ask how much time you spent in training an entity prediction model?
It is very important because traditional KG completion models can achieve good performance with much fewer training cost.
If LLMs cannot outperform them even with much large time consumption, it would be not that practical.

Please correct me if I have misunderstanding.
Cheers

@ZifengDing
Copy link
Author

Another question would be like:
I noticed that you only report hits@1. In your evaluation you try to find the ground truth missing entity in your generated text. I am not sure whether it is over optimistic since sometimes LLM hallucinates and maybe it is not treating the missing ground truth as the predicted answer but somehow the output text contains this entity (e.g., LLM outputs "I am not sure which one is the answer, but I think the ground truth should be either {ground truth} or {another entity}").
And another problem might occur is that it seems impossible to calculate hit@3,10 and mean reciprocal rank (MRR) in your framework. It is not a big problem though, but do you have any idea of how we can incorporate these metrics into your work?
For example, in this work (https://arxiv.org/pdf/2305.10613.pdf) it just treats entities and relations as pure numbers and rank the numbers. I also suspect how it works because no code was given. Since models like LLAMA are using BPE tokenizer, they treat each digit separately so it is not easy to rank the scores of multi-dight numbers, e.g., 15, 234.

@yao8839836
Copy link
Owner

@ZifengDing

Hi, thank you for your insightful comments on our work.

It took 3h15min for LLaMA-7B to train on WN18RR and 38h29min on YAGO3-10 with an A100 GPU.
It took 4h7min for LLaMA-13B to train on WN18RR and 49h53min on YAGO3-10 with an A100 GPU.

In our experiments, we found the original LLaMA and ChatGLM as well as ChatGPT
and GPT-4 answer like your examples, while the fine-tuned LLaMA and ChatGLM give concise and exact answers which can be easily evaluated, as shown in Table 6. In addition, we manually label the answer as correct or wrong for FB13-100 and YAGO3-10-100.

For hit@3,10 and MRR, maybe we can design an effective prompt to obtain 3 or 10 answers, e.g., "Please give three/ten possible entities, with the more reliable answers listed higher" and we can give some few-shot examples to enhance the prompt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants