diff --git a/app/articles/[slug]/page.tsx b/app/articles/[slug]/page.tsx
index 52f50ee..a3b75fb 100644
--- a/app/articles/[slug]/page.tsx
+++ b/app/articles/[slug]/page.tsx
@@ -196,14 +196,30 @@ export async function generateMetadata({
openGraph: {
title: frontMatter.title,
description: frontMatter.description,
+ url: `https://opensocial.world/articles/${params.slug}`,
type: "article",
publishedTime: frontMatter.publishedDate,
authors: frontMatter.authors?.map((author) => author.name),
+ images: [
+ {
+ url: frontMatter.image?.url
+ ? `https://opensocial.world${frontMatter.image.url}`
+ : "https://opensocial.world/images/psn/teaser.jpeg",
+ width: 1200,
+ height: 630,
+ alt: frontMatter.image?.alt || frontMatter.title,
+ },
+ ],
},
twitter: {
card: "summary_large_image",
title: frontMatter.title,
description: frontMatter.description,
+ images: [
+ frontMatter.image?.url
+ ? `https://opensocial.world${frontMatter.image.url}`
+ : "https://opensocial.world/images/psn/teaser.jpeg"
+ ],
},
// Add any article-specific schema.org metadata
alternates: {
diff --git a/content/articles/egonormia.mdx b/content/articles/egonormia.mdx
index 083df0d..78c2059 100644
--- a/content/articles/egonormia.mdx
+++ b/content/articles/egonormia.mdx
@@ -77,7 +77,7 @@ data_url: https://huggingface.co/datasets/open-social-world/EgoNormia
## Introduction
-In the video example, a hiking partner is stuck in the mud; a safety-first norm (keeping one’s distance) conflicts with the cooperative norm to help out. For humans, the right decision seems intuitive. But can Vision-Language Models (VLMs) navigate such dilemmas? Can they understand norms grounded in the physical world and make normative decisions similar to those of humans?
+In the video example, a hiking partner is stuck in the mud; a safety-first norm (keeping one's distance) conflicts with the cooperative norm to help out. For humans, the right decision seems intuitive. But can Vision-Language Models (VLMs) navigate such dilemmas? Can they understand norms grounded in the physical world and make normative decisions similar to those of humans?
@@ -90,7 +90,7 @@ To comprehensively measure VLM normative reasoning ability, we introduce
Unlike similarly visually-grounded spatiotemporal, predictive, or causal reasoning benchmarks ,
-EgoNormia evaluates models’ ability to reason about what should be done under social norms. EgoNormia highlights cases where these norm-related objectives conflict—the richest arena for evaluating normative decision-making.
+EgoNormia evaluates models' ability to reason about what should be done under social norms. EgoNormia highlights cases where these norm-related objectives conflict—the richest arena for evaluating normative decision-making.
@@ -155,9 +155,9 @@ We use a format of Multiple-Choice Questions (MCQs) for our task, including thre
* **Phase II: Answer Generation.** For each video sample, we generate four pairs of actions and justifications—one ground truth pair and three distractor pairs. To create challenging distractors, we systematically perturb the original context by altering key details that influence the interpretation of the action.
-* **Phase II: Filtering.** We perform normativity filtering by using chained LLMs to filter for answer feasibility and sensibility, then run blind filtering (i.e. no vision input) to remove questions answerable without context or through superficial reasoning, leaving only challenging,context-dependent questions.
+* **Phase III: Filtering.** We perform normativity filtering by using chained LLMs to filter for answer feasibility and sensibility, then run blind filtering (i.e. no vision input) to remove questions answerable without context or through superficial reasoning, leaving only challenging,context-dependent questions.
-* **Phase II: Human Validation.** Finally, two human validators are employed to verify the correct behavior and justification, and to select the list of actions that are considered sensible. Two validators are used to ensure every datapoint receives independent agreement from two humans, ensuring that human agreement on EgoNormia is replicable. The authors manually process datapoints where validators disagree on answers, ensuring that the benchmark remains challenging and achieves high human agreement.
+* **Phase IV: Human Validation.** Finally, two human validators are employed to verify the correct behavior and justification, and to select the list of actions that are considered sensible. Two validators are used to ensure every datapoint receives independent agreement from two humans, ensuring that human agreement on EgoNormia is replicable. The authors manually process datapoints where validators disagree on answers, ensuring that the benchmark remains challenging and achieves high human agreement.
Through automatic clustering with GPT-4o, we categorize the final videos into 5 high-level and 23 low-level categories, highlighting the rich diversity of our dataset.
@@ -171,7 +171,7 @@ In evaluation on EgoNormia, m
-To investigate **causes for the limited normative reasoning ability of VLMs (RQ2)**, We further categorize errors in normative reasoning by annotating the models’ full CoT responses on 100 representative tasks of EgoNormia. Four
+To investigate **causes for the limited normative reasoning ability of VLMs (RQ2)**, We further categorize errors in normative reasoning by annotating the models' full CoT responses on 100 representative tasks of EgoNormia. Four
failure modes were identified: (1) Norm sensibility errors, (2) Norm prioritization errors, (3) Perception errors, and (4) Answer refusal. For models, the majority of failures were due to sensibility errors instead of perception, suggesting that foundation models are competent in processing the visual context of the video inputs but fail in performing sound normative reasoning on the parsed context. Furthermore, the ratio of norm prioritization errors grows as the overall performance increases (GPT-4o < Gemini 1.5 Pro < Human), suggesting more capable models struggle more with determining which norm should take precedence in ambiguous situations.
@@ -199,4 +199,4 @@ Check out the videos, questions, and VLM predictions here.
This research was supported in part by Other Transaction award HR00112490375 from the U.S. Defense Advanced Research Projects Agency (DARPA) Friction for Accountability in Conversational Transactions (FACT) program. We thank Google Cloud Platform and Modal Platform for their credits. We thank feedback from Yonatan Bisk and members of the SALT lab at Stanford University. The authors thank Leena Mathur and Su Li for their help in collecting out-of-domain robotics videos.
-
+
\ No newline at end of file