Safe global deployment of AI models requires alignment with human values that vary across cultures. Yet rater pools in safety evaluation datasets remain largely geographically homogeneous, failing to capture geo-cultural differences. Further, it remains unclear whether such differences persist after controlling for demographics such as age, gender, and ethnicity. Through a meta-analysis of safety datasets, we find that most do not report geo-cultural information, and those that do lack a unified methodology to jointly analyze geo-cultural and demographic correlates. Using the Inglehart-Welzel dimensions of cross-cultural variation, we demonstrate via multilevel modeling that cultural zone membership explains variance in safety ratings beyond standard demographics (p<0.05 across 6 datasets). Moreover, our analysis indicates that roughly 10% of items in the datasets we examined are culturally sensitive: likely to be misclassified as safe without adequate cultural representation. We evaluate LLMs as both rater surrogates and triage tools, finding that current LLMs do not reliably stand in for raters, though they can help prioritize culturally sensitive items for human annotation. Our findings motivate more culturally pluralistic safety evaluation and offer practical takeaways to support it.
8
Safety datasets with both demographic and geo-cultural coverage (out of 1000+ search results)
6
Datasets show that geo-cultural values predict safety ratings better than demographics alone
~10%
Items in safety datasets are culturally sensitive: misclassified as safe if geo-cultural diversity is ignored
72 F1
Fine-tuned classifier performance on detecting culturally sensitive items.
Figure 1Geo-Cultural Diversity of Safety Dataset Raters
Each dot represents one country, sized by total number of raters, colored by cultural quadrant on the
Inglehart–Welzel map. Filter by dataset to see its geographic coverage. Cultures with high Traditional +
Self-expression (Q IV) and Secular + Survival (Q II) values are most underrepresented.
Filter by dataset:
Filter by zone:
Click a country on the world map to highlight it on the IW chart · click again to deselect
Inglehart–Welzel (2006) cultural map. Point size = total raters from that country.
Hover for per-dataset breakdown. Colors = cultural zone.
Section 4Multilevel Modeling Results
Multilevel models test whether geo-cultural background predicts safety annotations beyond rater/item variation
and beyond demographics. Negative ΔAIC = better model fit. * = significant after Benjamini–Hochberg correction.
ΔAIC: negative = improved fit. %Δσ²rater: reduction in unexplained rater variance.
* significant after Benjamini–Hochberg correction.
D+CZ vs. D: tests if adding cultural zones improves fit over demographics alone.
D×CZ vs. D+CZ: tests for a moderation effect (does culture change the effect of demographics).
Significant*
Not significant
← more negative ΔAIC = better model fit
Demographics vs. Base Model
Does adding annotator demographics improve fit beyond random effects alone?
Cultural Zones vs. Base Model
>Does adding annotator cultural zone improve fit beyond random effects alone?
Demographics + Cultural Zones vs. Demographics
Does culture explain variance beyond demographics alone?
Demographics × Cultural Zones vs. Demographics + Cultural Zones
Is there a moderation effect — does the influence of demographics on safety ratings differ across cultural zones?
Significant*
Not significant
← more negative = greater reduction in unexplained rater variance
Demographics vs. Base Model
How much rater variance is captured by the raters' demographics, beyond individual and item random effects?
Cultural Zones vs. Base Model
How much rater variance is captured by the raters' cultural zones, beyond individual and item random effects?
Demographics + Cultural Zones vs. Demographics
How much rater variance does culture explain on top of demographics?
Demographics × Cultural Zones vs. Demographics + Cultural Zones
Does allowing demographic effects to vary by cultural zone reduce further unexplained rater variance?
Section 5Culturally Sensitive Items
A culturally sensitive item (CSI) is one that exactly one cultural quadrant deems unsafe — a false negative
if that quadrant's perspective is excluded from annotation. Each card below is placed in the quadrant
that most strongly flags it as unsafe, with inline bars showing all quadrants' unsafe probability.
Table 5: Culturally Sensitive Items by Dataset & Quadrant
Dataset
Q I
Q II
Q III
Q IV
Total
Rate
DIVE
27
9
85
2
123
13.9%
DICES-990
23
–
107
–
130
13.1%
D3
126
12
289
58
485
10.9%
CREHate
33
92
–
49
174
11.1%
NLPos
0
0
1
0
1
11.1%†
Severity
0
–
2
0
2
3.0%
† NLPos estimate based on only 9 items with multi-quadrant coverage.
Examples of Culturally Sensitive Items
Cards are placed in the quadrant that most strongly flags the item as unsafe.
Bars show unsafe probability per quadrant.
P(CSI) is the empirical probability of the item to be culturally sensitive: rated as unsafe only by one quadrant.
All estimates are empirical and based on imperfect data currently available, with low sample sizes per item per quadrant.
0.50
loading…
Q II · Secular + SurvivalSecular-rational / Survival0
Q I · Secular + Self-expressionSecular-rational / Self-expression0
Q III · Traditional + SurvivalTraditional / Survival0
Q IV · Traditional + Self-expressionTraditional / Self-expression0
Section 6LLM Experiments: Opportunities & Limits
Can LLMs replace geo-culturally diverse human raters? We test: (1) predicting cultural quadrant safety ratings directly, and (2) identifying culturally sensitive items for prioritized annotation.
Task 1: Predicting Cultural Quadrant Safety Judgments
Four models (DeBERTa-Large, Gemma-3-4B, Gemini-3 Flash, GPT-5 Nano) were tasked with predicting each cultural quadrant's safety rating.
Key result: Models fail to outperform the "Always Unsafe" baseline on D3 (all 4 quadrants),
especially for Q II and Q IV. Scaling model size yielded no conclusive improvement.
⚠️ LLMs cannot reliably substitute for diverse human raters
Fine-tuned LMs and LLM-as-a-Judge setups underperform in emulating judgments from diverse cultural backgrounds, motivating continued pluralistic human annotation.
Gemma significantly outperforms the majority-class baseline (0.72 F1, p = 0.044); DeBERTa shows a similar trend (0.71 F1, p = 0.071). Both models degrade ~14% on the sensitive item identification task vs. standard safe/unsafe classification.
Section 7Practical Takeaways
Recommendations for future safety data collection to ensure geo-cultural coverage.
📐
Stratify by Cultural Quadrant
Stratify raters not only on age, gender, and ethnicity, but also on cultural quadrant—for example, using the Inglehart–Welzel map.
🗺️
Use Adequate Geographic Proxies for Culture
In the absence of cultural self-identification or value survey data, use country of longest residence or country of birth (or finer regional attributes) as a proxy for cultural background.
📊
Use Multilevel Models
Avoid brittle methods for analyzing rater disagreements; use multilevel models to properly control for variation in raters and items.
🤖
Use LLMs for Triage, Not Replacement
Use a fine-tuned LLM classifier to prioritize culturally sensitive items for human annotation when budget constraints require it.