Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release truth vectors? #3

Open
dribnet opened this issue Jan 20, 2025 · 2 comments
Open

release truth vectors? #3

dribnet opened this issue Jan 20, 2025 · 2 comments

Comments

@dribnet
Copy link

dribnet commented Jan 20, 2025

Thanks for providing code to replicate the experiments!

Could you also provide the (optimal) truth vectors for the supported models?

@dribnet
Copy link
Author

dribnet commented Jan 24, 2025

Here are the t_g and t_p truth vectors for meta-llama/Meta-Llama-3-8B-Instruct (aka llama-3-8b-chat) I extracted by running the truth_directions code locally; these are meant to be used on layer 12 of the residual stream.

truth_vectors.npz.zip

Image

I've done some initial tests on these and seem to be getting sensible results. For example, if I look at t_g cosine similarity on a dataset of captions, I get the lowest numbers for descriptions of "low probability" (or perhaps just highly incongruous) descriptions like "monks playing rock music", "mona lisa smoking a cigar", and "an astronaut dog on martian terrain".

I can post other results on this thread including truth vectors for other models and/or can consolidate these into a pull request if there is interest.

@dribnet
Copy link
Author

dribnet commented Jan 25, 2025

Out of curiosity I generated truth vectors for the new DeepSeek-R1-Distill-Llama-8B model. (again for layer 12)

truth_vectors.npz.zip

Image

Most of the stats seemed reasonable - the only surprise was that the separation scores were lower. But the shape was right and they still peaked around layer 12.

Image

My plan is to examine data that scores high (or low) in cosine similarity to t_g in this model but not in the llama-3-8b-chat model to get a sense of what the 'diff' between these truth vectors might be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant