Quantifying the Salience of Geo-Cultural Values for Pluralistic Safety Alignment

OverviewAbstract

Safe global deployment of AI models requires alignment with human values that vary across cultures. Yet rater pools in safety evaluation datasets remain largely geographically homogeneous, failing to capture geo-cultural differences. Further, it remains unclear whether such differences persist after controlling for demographics such as age, gender, and ethnicity. Through a meta-analysis of safety datasets, we find that most do not report geo-cultural information, and those that do lack a unified methodology to jointly analyze geo-cultural and demographic correlates. Using the Inglehart-Welzel dimensions of cross-cultural variation, we demonstrate via multilevel modeling that cultural zone membership explains variance in safety ratings beyond standard demographics (p<0.05 across 6 datasets). Moreover, our analysis indicates that roughly 10% of items in the datasets we examined are culturally sensitive: likely to be misclassified as safe without adequate cultural representation. We evaluate LLMs as both rater surrogates and triage tools, finding that current LLMs do not reliably stand in for raters, though they can help prioritize culturally sensitive items for human annotation. Our findings motivate more culturally pluralistic safety evaluation and offer practical takeaways to support it.

Safety datasets with both demographic and geo-cultural coverage (out of 1000+ search results)

Datasets show that geo-cultural values predict safety ratings better than demographics alone

~10%

Items in safety datasets are culturally sensitive: misclassified as safe if geo-cultural diversity is ignored

72 F1

Fine-tuned classifier performance on detecting culturally sensitive items.

Figure 1Geo-Cultural Diversity of Safety Dataset Raters

Each dot represents one country, sized by total number of raters, colored by cultural quadrant on the Inglehart–Welzel map. Filter by dataset to see its geographic coverage. Cultures with high Traditional + Self-expression (Q IV) and Secular + Survival (Q II) values are most underrepresented.

Click a country on the world map to highlight it on the IW chart · click again to deselect

Inglehart–Welzel (2006) cultural map. Point size = total raters from that country. Hover for per-dataset breakdown. Colors = cultural zone.

Section 4Multilevel Modeling Results

Multilevel models test whether geo-cultural background predicts safety annotations beyond rater/item variation and beyond demographics. Negative ΔAIC = better model fit. * = significant after Benjamini–Hochberg correction.

ΔAIC: negative = improved fit. %Δσ²_rater: reduction in unexplained rater variance. * significant after Benjamini–Hochberg correction.

D+CZ vs. D: tests if adding cultural zones improves fit over demographics alone. D×CZ vs. D+CZ: tests for a moderation effect (does culture change the effect of demographics).

Significant* Not significant ← more negative ΔAIC = better model fit

Demographics vs. Base Model

Does adding annotator demographics improve fit beyond random effects alone?

Cultural Zones vs. Base Model

>Does adding annotator cultural zone improve fit beyond random effects alone?

Demographics + Cultural Zones vs. Demographics

Does culture explain variance beyond demographics alone?

Demographics × Cultural Zones vs. Demographics + Cultural Zones

Is there a moderation effect — does the influence of demographics on safety ratings differ across cultural zones?

Significant* Not significant ← more negative = greater reduction in unexplained rater variance

Demographics vs. Base Model

How much rater variance is captured by the raters' demographics, beyond individual and item random effects?

Cultural Zones vs. Base Model

How much rater variance is captured by the raters' cultural zones, beyond individual and item random effects?

Demographics + Cultural Zones vs. Demographics

How much rater variance does culture explain on top of demographics?

Demographics × Cultural Zones vs. Demographics + Cultural Zones

Does allowing demographic effects to vary by cultural zone reduce further unexplained rater variance?

Section 5Culturally Sensitive Items

A culturally sensitive item (CSI) is one that exactly one cultural quadrant deems unsafe — a false negative if that quadrant's perspective is excluded from annotation. Each card below is placed in the quadrant that most strongly flags it as unsafe, with inline bars showing all quadrants' unsafe probability.

Table 5: Culturally Sensitive Items by Dataset & Quadrant

Dataset	Q I	Q II	Q III	Q IV	Total	Rate
DIVE	27	9	85	2	123	13.9%
DICES-990	23	–	107	–	130	13.1%
D3	126	12	289	58	485	10.9%
CREHate	33	92	–	49	174	11.1%
NLPos	0	0	1	0	1	11.1%†
Severity	0	–	2	0	2	3.0%

† NLPos estimate based on only 9 items with multi-quadrant coverage.

Examples of Culturally Sensitive Items

Cards are placed in the quadrant that most strongly flags the item as unsafe. Bars show unsafe probability per quadrant. P(CSI) is the empirical probability of the item to be culturally sensitive: rated as unsafe only by one quadrant. All estimates are empirical and based on imperfect data currently available, with low sample sizes per item per quadrant.

loading…

Q II · Secular + SurvivalSecular-rational / Survival 0

Q I · Secular + Self-expressionSecular-rational / Self-expression 0

Q III · Traditional + SurvivalTraditional / Survival 0

Q IV · Traditional + Self-expressionTraditional / Self-expression 0

Section 6LLM Experiments: Opportunities & Limits

Can LLMs replace geo-culturally diverse human raters? We test: (1) predicting cultural quadrant safety ratings directly, and (2) identifying culturally sensitive items for prioritized annotation.

Task 1: Predicting Cultural Quadrant Safety Judgments

Four models (DeBERTa-Large, Gemma-3-4B, Gemini-3 Flash, GPT-5 Nano) were tasked with predicting each cultural quadrant's safety rating. Key result: Models fail to outperform the "Always Unsafe" baseline on D3 (all 4 quadrants), especially for Q II and Q IV. Scaling model size yielded no conclusive improvement.

⚠️ LLMs cannot reliably substitute for diverse human raters

Fine-tuned LMs and LLM-as-a-Judge setups underperform in emulating judgments from diverse cultural backgrounds, motivating continued pluralistic human annotation.

Task 2: Identifying Culturally Sensitive Items (Figure 3)

Gemma significantly outperforms the majority-class baseline (0.72 F1, p = 0.044); DeBERTa shows a similar trend (0.71 F1, p = 0.071). Both models degrade ~14% on the sensitive item identification task vs. standard safe/unsafe classification.

Section 7Practical Takeaways

Recommendations for future safety data collection to ensure geo-cultural coverage.

📐

Stratify by Cultural Quadrant

Stratify raters not only on age, gender, and ethnicity, but also on cultural quadrant—for example, using the Inglehart–Welzel map.

🗺️

Use Adequate Geographic Proxies for Culture

In the absence of cultural self-identification or value survey data, use country of longest residence or country of birth (or finer regional attributes) as a proxy for cultural background.

📊

Use Multilevel Models

Avoid brittle methods for analyzing rater disagreements; use multilevel models to properly control for variation in raters and items.

🤖

Use LLMs for Triage, Not Replacement

Use a fine-tuned LLM classifier to prioritize culturally sensitive items for human annotation when budget constraints require it.