You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here are the t_g and t_p truth vectors for meta-llama/Meta-Llama-3-8B-Instruct (aka llama-3-8b-chat) I extracted by running the truth_directions code locally; these are meant to be used on layer 12 of the residual stream.
I've done some initial tests on these and seem to be getting sensible results. For example, if I look at t_g cosine similarity on a dataset of captions, I get the lowest numbers for descriptions of "low probability" (or perhaps just highly incongruous) descriptions like "monks playing rock music", "mona lisa smoking a cigar", and "an astronaut dog on martian terrain".
I can post other results on this thread including truth vectors for other models and/or can consolidate these into a pull request if there is interest.
Most of the stats seemed reasonable - the only surprise was that the separation scores were lower. But the shape was right and they still peaked around layer 12.
My plan is to examine data that scores high (or low) in cosine similarity to t_g in this model but not in the llama-3-8b-chat model to get a sense of what the 'diff' between these truth vectors might be.
Thanks for providing code to replicate the experiments!
Could you also provide the (optimal) truth vectors for the supported models?
The text was updated successfully, but these errors were encountered: