EXPERIMENTAL_DESIGN.md

Behavioral Decision-Making Profile Tool

Experimental Methodology & Scoring Reference

Version: 1.0
Status: Foundation Draft
Last Updated: February 2026

Overview

This document specifies the complete experimental methodology for the Behavioral Decision-Making Profile Tool. Every task is grounded in published behavioral economics research. Scoring algorithms, benchmark distributions, and archetype definitions are derived from or validated against peer-reviewed datasets.

Four Dimensions Measured:

Risk Preferences — How you trade off certainty against expected value
Time Preferences — How you value present vs. future payoffs
Social Preferences — How you weight others' outcomes relative to your own
Cognitive Biases — How systematic heuristics affect your judgments

Total questions: 30 (presented as 30 distinct screens)
Estimated completion time: 15–20 minutes
Benchmark primary source: Falk et al. (2018), Global Preference Survey, n=80,000+

Section 1: Risk Preferences

Theoretical Background

Risk preferences are typically modeled using Expected Utility Theory with a Constant Relative Risk Aversion (CRRA) utility function: U(x) = x^(1-r) / (1-r), where r is the risk aversion coefficient. Higher r → more risk averse.

The GPS (Global Preference Survey) uses a combination of hypothetical lottery questions and self-reported willingness-to-take-risks items, validated against incentivized experimental measures (Falk et al., 2018).

Key finding from benchmark data: The GPS global mean willingness-to-take-risks is approximately 4.4 on a 0–10 scale (SD ≈ 2.5). Risk aversion is the modal preference globally.

Task R1: Holt-Laury Lottery Choices (5 questions)

Design: Derived from Holt & Laury (2002). Participants choose between a "safe" and "risky" option across 5 decision rows. The crossover point (where someone switches from safe to risky) identifies their risk aversion coefficient.

Why this works: As you move down the rows, the expected value of the risky option increases relative to the safe option. A risk-neutral person switches at row 4. Switching earlier → risk-seeking; switching later → risk-averse.

Adaptation note: Original Holt-Laury uses real monetary stakes (typically $0.10–$3.85 range). Web version uses hypothetical choices with realistic magnitudes. Research shows hypothetical vs. incentivized choices produce similar rank-orderings at individual level (see Holt & Laury 2002, Table 3; Dohmen et al., 2011 for validation of qualitative measures).

Question R1.1

Screen title: "Which would you prefer?"

Prompt: In the following choice, which option would you choose?

Option A (Safer)	Option B (Riskier)
90% chance of $20 / 10% chance of $16	90% chance of $38 / 10% chance of $1

Expected value: Option A = $19.60 / Option B = $34.30

Response format: Binary choice button (Option A / Option B)

Scoring: Choosing A at this row is strongly risk-averse (even risk-seekers usually pick A). Recorded as hl_choice_1 = 0 (A) or 1 (B).

Tooltip text (expandable): "This is called a lottery choice task. We're measuring how you trade off certainty against higher expected payoffs — a core dimension of decision-making studied since the 1940s."

Question R1.2

Prompt: Which would you prefer?

Option A (Safer)	Option B (Riskier)
70% chance of $20 / 30% chance of $16	70% chance of $38 / 30% chance of $1

Expected value: Option A = $18.80 / Option B = $26.90

Response format: Binary choice button
Scoring: hl_choice_2 = 0 (A) or 1 (B)

Question R1.3

Prompt: Which would you prefer?

Option A (Safer)	Option B (Riskier)
50% chance of $20 / 50% chance of $16	50% chance of $38 / 50% chance of $1

Expected value: Option A = $18.00 / Option B = $19.50

Response format: Binary choice button
Scoring: hl_choice_3 = 0 (A) or 1 (B)

Note: This is the crossover point for risk-neutral agents. Expected values are now nearly equal.

Question R1.4

Prompt: Which would you prefer?

Option A (Safer)	Option B (Riskier)
30% chance of $20 / 70% chance of $16	30% chance of $38 / 70% chance of $1

Expected value: Option A = $17.20 / Option B = $12.10

Response format: Binary choice button
Scoring: hl_choice_4 = 0 (A) or 1 (B)

Note: Option B now has lower expected value. Only strongly risk-seeking individuals choose B here.

Question R1.5

Prompt: Which would you prefer?

Option A (Safer)	Option B (Riskier)
10% chance of $20 / 90% chance of $16	10% chance of $38 / 90% chance of $1

Expected value: Option A = $16.40 / Option B = $4.70

Response format: Binary choice button
Scoring: hl_choice_5 = 0 (A) or 1 (B)

Note: Choosing B here reflects extreme risk-seeking or random responding. Flag for quality check if hl_choice_5 = 1 and hl_choice_1 = 0.

Holt-Laury Scoring Algorithm:

Benchmark (Holt & Laury 2002, n=212, students):

~3% chose risky from row 1 (extreme risk-seeking)
~65% switched at rows 4-5 (risk-averse to highly risk-averse)
Median crossover: row 4–5 (slightly risk-averse)

Task R2: Domain-Specific Risk (3 questions)

Design: Adapted from Dohmen et al. (2011) and the GPS qualitative module. Single-item willingness-to-take-risks measures in specific domains. Despite their simplicity, single-item scales predict real-world risk-taking behavior across domains (Dohmen et al., 2011, find r=0.47 correlation with incentivized lottery task).

Response format: Horizontal slider 0–10, labeled "Not at all willing" to "Completely willing"

Question R2.1 — Financial Risk

Prompt: "How willing are you to take risks in financial matters, such as investing?"

financial_risk_score = slider value (0–10)

GPS benchmark (Falk et al. 2018, n=80,000):

Global mean: 3.8 (SD: 2.7)
25th percentile: 2 / Median: 4 / 75th percentile: 6

Question R2.2 — Career Risk

Prompt: "How willing are you to take risks when it comes to career choices, such as starting your own business or changing fields?"

career_risk_score = slider value (0–10)

GPS benchmark:

Global mean: 4.1 (SD: 2.6)
25th percentile: 2 / Median: 4 / 75th percentile: 6

Question R2.3 — General Risk (GPS Primary Item)

Prompt: "In general, how willing are you to take risks?"

general_risk_score = slider value (0–10)

GPS benchmark:

Global mean: 4.4 (SD: 2.5) — this is the primary GPS risk item
25th percentile: 2 / Median: 4 / 75th percentile: 6
US mean: 5.1 (slightly higher than global)

Risk Preferences — Composite Score:

Section 2: Time Preferences

Theoretical Background

Time preferences are modeled using the quasi-hyperbolic (β-δ) discounting model (Laibson, 1997; O'Donoghue & Rabin, 1999):

V(t) = u(c_0) + β Σ_{t=1}^{T} δ^t u(c_t)

δ (delta): Long-run patience. δ close to 1 = very patient.
β (beta): Present bias. β < 1 means you systematically over-weight the present.

A purely exponential discounter has β=1. Present-biased individuals have β<1 — they prefer "now" over "soon" more than they prefer "soon" over "later."

GPS time preference benchmark: Patient people make up roughly 55% of the global sample; present-biased behavior is common but heterogeneous.

Task T1: Intertemporal Choice — Patience (5 questions)

Design: Staircase method adapted from Falk et al. (2018). Each question asks: receive a smaller amount today, or wait 12 months for a larger amount?

Why staircase? Each answer refines the estimate: choosing "wait" triggers a lower immediate amount next; choosing "now" triggers a higher immediate amount next. This efficiently narrows down the indifference point.

Framing: All amounts stated in USD. No compounding — we're measuring pure time preference, not financial sophistication.

Question T1.1 (Starting point)

Prompt: "Imagine you won a prize. You can receive it in one of two ways:"

Option A (Sooner)	Option B (Later)
$100 today	$150 in 12 months

Response format: Binary choice
Scoring: patience_1 = 0 (today) or 1 (later)

If patience_1 = 1 (chose later): Next question offers lower "today" amount → T1.2a
If patience_1 = 0 (chose today): Next question offers higher "today" amount → T1.2b

Question T1.2a (Chose "later" in T1.1 — testing lower bound)

Prompt: Same framing.

Option A (Sooner)	Option B (Later)
$60 today	$150 in 12 months

Scoring: patience_2 = 0 (today) or 1 (later)

Question T1.2b (Chose "today" in T1.1 — testing higher bound)

Prompt: Same framing.

Option A (Sooner)	Option B (Later)
$130 today	$150 in 12 months

Scoring: patience_2 = 0 (today) or 1 (later)

Questions T1.3–T1.5

Continue staircase branching. Final tree (simplified):

              [100 vs 150]
             /             \
        [60 vs 150]      [130 vs 150]
        /        \           /      \
  [30 vs 150] [80 vs 150] [115 vs 150] [140 vs 150]

The indifference point (implicit annual discount rate) is estimated from the final switch.

Discount rate estimation:

GPS benchmark:

Median estimated discount rate: ~15–25% per year (varies by country)
US sample: ~20% median annual discount rate
Top quartile (very patient): <10% annual discount rate

Task T2: Present Bias Detection (2 questions)

Design: Thaler (1981) and Loewenstein & Prelec (1992) classic present-bias test. The same delay (1 month) feels larger when it's "now vs. 1 month" than "11 months vs. 12 months."

Key insight: A time-consistent (exponential) discounter should have the same preference in both frames. A present-biased person prefers immediate gratification more strongly when "now" is on the table.

Question T2.1 — Near-future frame

Prompt: "Which would you prefer?"

Option A: Receive $120 in 1 month
Option B: Receive $150 in 2 months

present_bias_near = 0 (A: sooner) or 1 (B: later)

Question T2.2 — Far-future frame

Prompt: "Which would you prefer?"

Option A: Receive $120 in 7 months
Option B: Receive $150 in 8 months

present_bias_far = 0 (A: sooner) or 1 (B: later)

Present Bias Scoring:

Prevalence benchmark: Approximately 40–50% of individuals show present-biased patterns in laboratory settings (Frederick, Loewenstein & O'Donoghue, 2002 meta-analysis).

Task T3: Patience Self-Report (1 question)

Prompt: "How patient are you in general? For example, when it comes to waiting for things you want?"

patience_self_report = slider 0–10 ("Very impatient" to "Very patient")

GPS benchmark:

Global mean: 4.9 (SD: 2.6)
US mean: 5.6

Time Preferences — Composite Score:

Section 3: Social Preferences

Theoretical Background

Social preferences describe how people weight others' payoffs in their utility function. Following Fehr & Schmidt (1999) and the GPS framework, we measure four distinct components:

Altruism: Unconditional positive concern for others' welfare
Positive reciprocity: Willingness to return kindness (above what self-interest dictates)
Negative reciprocity: Willingness to punish unfairness (even at personal cost)
Trust: Prior belief that others have good intentions

These are distinct constructs — someone can be highly altruistic but low on trust (they give unconditionally but don't expect reciprocation).

Key finding from GPS: Social preferences show substantial cross-national variation. Trust and positive reciprocity are strongly correlated; negative reciprocity is largely independent.

Task S1: Dictator Game — Altruism (1 question)

Design: Standard dictator game (Kahneman, Knetsch & Thaler, 1986; Forsythe et al., 1994). One player (the "dictator") receives a fixed endowment and decides how much, if any, to share with a passive recipient.

Adaptation: Hypothetical framing. Research shows hypothetical dictator game responses correlate r ≈ 0.40 with incentivized choices (Franzen & Pointner, 2013).

Question S1.1

Prompt: "Imagine you have been given $100 unexpectedly. You can keep it all, or give some to a stranger — someone you've never met and will never meet again. How much would you give?"

Response format: Slider $0–$100 (in $5 increments)

dictator_give = dollar amount (0–100)

Altruism score: dictator_give / 100 * 100 (i.e., proportion given, scaled to 0–100)

Benchmark (meta-analysis by Engel, 2011, n=20,813 from 129 studies):

Mean giving: 28.4% (~$28 on $100 endowment)
Modal giving: $0 (≈36% of subjects give nothing)
25th percentile: 10% / Median: 25% / 75th percentile: 50%

Task S2: Reciprocity — Positive (1 question)

Design: GPS qualitative item (Falk et al., 2018). Measures willingness to go beyond self-interest to help someone who helped you.

Question S2.1

Prompt: "How strongly do you agree with the following statement: When someone does me a favor, I am willing to go out of my way to return the favor — even when it's inconvenient for me."

Response format: Slider 0–10 ("Not at all" to "Very strongly")

pos_reciprocity_score = slider value (0–10)

GPS benchmark:

Global mean: 6.8 (SD: 2.3) — positive reciprocity is high globally
25th percentile: 5 / Median: 7 / 75th percentile: 9

Task S3: Reciprocity — Negative (1 question)

Design: GPS qualitative item. Distinct from altruism: measures willingness to punish unfair behavior at personal cost (the "costly punishment" phenomenon from Fehr & Gächter, 2002).

Question S3.1

Prompt: "If someone treats me very unjustly, I will take action to retaliate — even if it comes at a cost to me."

Response format: Slider 0–10 ("Not at all" to "Very strongly")

neg_reciprocity_score = slider value (0–10)

GPS benchmark:

Global mean: 5.0 (SD: 2.6) — more variable than positive reciprocity
25th percentile: 3 / Median: 5 / 75th percentile: 7
US mean: 5.4

Task S4: Trust (2 questions)

Design: Adapted from the World Values Survey trust item (Inglehart et al., 2014) and GPS. Trust predicts economic outcomes at both individual and country level (Knack & Keefer, 1997; Algan & Cahuc, 2010).

Question S4.1 — General Trust

Prompt: "In general, do you assume that most people have good intentions?"

Response format: Slider 0–10 ("No, people are mainly self-interested" to "Yes, most people are trustworthy")

trust_general = slider value (0–10)

Question S4.2 — Trust in Strangers

Prompt: "Imagine you lost your wallet containing $200. How likely is it that a stranger would return it to you intact?"

Response format: Slider 0–10 ("Very unlikely" to "Very likely")

trust_stranger = slider value (0–10)

Note on the wallet question: This is a well-validated behavioral trust elicitation — Cohn et al. (2019) ran a field experiment in 40 countries actually dropping wallets, finding trust self-reports predict return rates at country level (r = 0.65).

GPS Trust benchmark:

Global mean: 4.9 (SD: 2.5)
US mean: 5.8
High-trust countries (Scandinavia): mean ≈ 7.0

Social Preferences — Composite Scoring:

Section 4: Cognitive Biases

Theoretical Background

Unlike the preference dimensions above (which describe stable individual traits), cognitive biases are systematic deviations from rational processing. We measure four experimentally robust biases:

Anchoring — Insufficient adjustment from a salient reference number (Tversky & Kahneman, 1974)
Framing — Preference reversals based on equivalent but differently-framed options (Kahneman & Tversky, 1984)
Overconfidence — Systematic overestimation of one's own knowledge (Lichtenstein & Fischhoff, 1977)
Status Quo Bias — Preference for the default/current option (Samuelson & Zeckhauser, 1988)

Critical design note: Biases require within-person or between-subjects experimental designs. For a web tool, we use between-subjects randomization (users are randomly assigned to conditions) and compare group averages. We cannot tell an individual user "you are anchored" — we can only describe group-level tendencies. This is honestly communicated in results.

Task B1: Anchoring (2 questions)

Design: Classic Tversky & Kahneman (1974) anchoring paradigm. Participants are randomly assigned to HIGH or LOW anchor condition (randomized at session start). The anchor is embedded in the question framing.

Implementation: At session initialization, randomly assign anchor_condition = "HIGH" or "LOW" and store in session. This determines which version of questions B1.1–B1.2 the user sees.

Question B1.1 — UN/Africa Question (Classic Anchoring)

HIGH anchor version:

"The United Nations has 193 member countries. Of all the countries in Africa, what percentage are members of the United Nations?"

LOW anchor version:

"Some people estimate that only a few African countries are UN members. Of all the countries in Africa, what percentage are members of the United Nations?"

Response format: Numeric input, 0–100 (%)

True answer: ~98% (54 of 55 African countries are UN members)

africa_un_estimate = numeric response

Scoring: Distance from true answer reveals anchoring magnitude. Compare HIGH vs. LOW condition means in aggregate data.

Expected group-level effect: HIGH anchor users should estimate higher than LOW anchor users (classic Tversky & Kahneman finding). Effect size d ≈ 0.5–0.8 in most replications.

Question B1.2 — Anchoring in Everyday Judgment

HIGH anchor version:

"The tallest building in New York City (the One World Trade Center) is 1,776 feet tall. How tall do you think the Chrysler Building is, in feet?"

LOW anchor version:

"A typical New York brownstone building is about 35 feet tall. How tall do you think the Chrysler Building is, in feet?"

True answer: 1,046 feet

chrysler_estimate = numeric response

Note on individual-level anchoring scores: We compute a group-level anchoring susceptibility metric from aggregate data. For individual results, we report the direction of their estimate relative to the true answer, framed descriptively.

Task B2: Framing Effect (2 questions)

Design: Asian Disease Problem, adapted from Kahneman & Tversky (1984). The most replicated framing effect in behavioral economics. Participants are randomly assigned to GAIN or LOSS frame (both frames describe identical outcomes).

Between-subjects randomization: frame_condition = "GAIN" or "LOSS" assigned at session start.

Question B2.1 — Disease Outbreak Framing

GAIN frame version:

"Imagine a rare disease is expected to affect 600 people. Two response programs are being considered:

Program A: 200 people will be saved

Program B: 1/3 probability that 600 people will be saved, 2/3 probability that no one is saved"

Which program do you support?

LOSS frame version:

"Imagine a rare disease is expected to affect 600 people. Two response programs are being considered:

Program A: 400 people will die

Program B: 1/3 probability that nobody dies, 2/3 probability that 600 people die"

Which program do you support?

Note: Both frames are objectively identical. Program A = certain outcome of saving 200/losing 400. Program B = same gamble either way.

disease_choice = 0 (Program A / certain) or 1 (Program B / gamble)

Expected framing effect (Kahneman & Tversky 1984 original):

GAIN frame: 72% choose Program A (risk-averse under gains)
LOSS frame: 22% choose Program A (risk-seeking under losses)
Difference = ~50 percentage points

Benchmark (meta-analysis, Kühberger 1998, n=8,000+):

Effect size d ≈ 0.6–0.8 across replications
Effect is robust but smaller in some populations

Question B2.2 — Personal Decision Framing

GAIN frame version:

"You're considering a new investment strategy. Your advisor says: 'This portfolio has a 40% chance of gaining $5,000.' Do you take the investment?"

LOSS frame version:

"You're considering a new investment strategy. Your advisor says: 'This portfolio has a 60% chance of losing nothing, 40% chance of gaining $5,000 net.' Do you take the investment?"

(Mathematically equivalent)

investment_choice = 0 (decline) or 1 (accept)

Framing Score (Group-Level Only):

Task B3: Overconfidence (3 questions)

Design: Calibration task from Lichtenstein & Fischhoff (1977). For each question, users provide both an answer AND a confidence level. Overconfidence = stated confidence systematically exceeds accuracy rate.

Why this is measurable at individual level: Unlike framing (requires between-subjects), overconfidence is measured within-person by comparing stated confidence to actual accuracy across multiple items.

Question B3.1 — Knowledge Calibration: History

Prompt: "In what year was the Eiffel Tower completed?"

Response: Numeric input for year

Then: "How confident are you that your answer is within 5 years of the correct answer?"

eiffel_estimate = year
eiffel_confidence = slider 0–100%
True answer: 1889

Question B3.2 — Knowledge Calibration: Science

Prompt: "How far is the Earth from the Sun, in miles? (Approximate average distance)"

Response: Numeric input

Then: "How confident are you that your answer is within 20% of the correct value?"

sun_distance_estimate = number
sun_confidence = slider 0–100%
True answer: ~93,000,000 miles (accept 74M–112M as "within 20%")

Question B3.3 — Subjective Confidence Interval

Prompt: "Without looking it up: How many bones are in the adult human body? Please give a range you are 90% confident contains the true answer. (There is a 90% chance the true answer falls within your range.)"

Response: Low estimate + High estimate (two inputs)

bones_low = lower bound
bones_high = upper bound
True answer: 206

Calibration scoring for B3.3:

Overconfidence Benchmark:

Lichtenstein & Fischhoff (1977): For general knowledge questions, stated 90% CIs contain true value only ~50% of the time
Moore & Healy (2008) distinguish three types: overestimation, overplacement, overprecision — this task measures overprecision
Expected capture rate for "90% CIs": 40–60% in most populations

Task B4: Status Quo Bias (1 question)

Design: Samuelson & Zeckhauser (1988) status quo bias paradigm. Adapted to a pension/investment framing. A randomly assigned "default" option captures the effect — people systematically over-select the default regardless of its quality.

Between-subjects randomization: default_condition = "A" or "B" assigned at session start.

Question B4.1

DEFAULT A version:

"Your employer is automatically enrolling you in Retirement Plan A. You can change to Plan B if you prefer.

Plan A (your current plan): Moderate growth, moderate risk — estimated annual return 5%, medium volatility

Plan B: Higher growth, higher risk — estimated annual return 8%, higher volatility

Do you keep Plan A or switch to Plan B?"

DEFAULT B version:

"Your employer is automatically enrolling you in Retirement Plan B. You can change to Plan A if you prefer.

Plan A: Moderate growth, moderate risk — estimated annual return 5%, medium volatility

Plan B (your current plan): Higher growth, higher risk — estimated annual return 8%, higher volatility

Do you keep Plan B or switch to Plan A?"

retirement_choice = "A" or "B" + is_default = True/False (derived from condition)

Status Quo Bias Scoring (Group Level):

Section 5: Archetype System

Design Philosophy

Archetypes are user-facing labels that synthesize across dimensions. They must be:

Non-reductive: Each archetype highlights strengths, not deficits
Research-grounded: Derived from real profile clusters, not invented
Memorable: A user should recall their archetype a week later

Methodological basis: Falk et al. (2018) conduct cluster analysis on GPS data and identify distinct preference profiles. Our archetypes adapt their findings to include the cognitive bias dimension.

Archetype Definitions (8 types)

Each user is assigned one archetype based on their risk, patience, and social preference scores. Cognitive biases are reported as overlays, not defining features.

Archetype Descriptions:

Archetype	Risk	Patience	Social	Core Trait
The Prudent Steward	Low	High	High	Disciplined, trustworthy, long-horizon thinking
The Bold Opportunist	High	Low	Low	Decisive, action-oriented, self-reliant
The Generous Guardian	Low	Low	High	Caring, protective, present-focused giving
The Visionary Builder	High	High	High	Strategic, ambitious, community-minded
The Strategic Architect	High	High	Low	Independent, long-game thinker
The Careful Analyst	Low	High	Low	Methodical, reserved, precision-focused
The Collaborative Pragmatist	Medium	Medium	High	Balanced, team-oriented
The Adaptive Realist	Mixed	Mixed	Mixed	Contextual, flexible, situationally driven

Section 6: Question Order & Randomization Protocol

To minimize order effects and carryover biases:

Section 7: Data Quality & Exclusion Criteria

Section 8: Benchmark Sources & Citations

Primary Benchmark

Falk, A., Becker, A., Dohmen, T., Enke, B., Huffman, D., & Sunde, U. (2018). "Global Evidence on Economic Preferences." Quarterly Journal of Economics, 133(4), 1645–1692.

n = 80,000+ respondents across 76 countries
Publicly available through the briq Institute: https://www.briq-institute.org/global-preferences/home
Data access: GPS dataset available for research use upon registration
Variables used: risk, patience, positive reciprocity, negative reciprocity, altruism, trust
Weighting: national representative samples, GPS composite weights

Secondary Benchmarks

Holt, C.A., & Laury, S.K. (2002). "Risk Aversion and Incentive Effects." American Economic Review, 92(5), 1644–1655.

Original lottery choice task methodology
n = 212 student subjects (incentivized)
Used for: Holt-Laury crossover → CRRA mapping

Kahneman, D., & Tversky, A. (1979). "Prospect Theory: An Analysis of Decision under Risk." Econometrica, 47(2), 263–292.

Foundational loss aversion and probability weighting
Used for: theoretical framing of framing effects and risk preferences

Kahneman, D., & Tversky, A. (1984). "Choices, Values, and Frames." American Psychologist, 39(4), 341–350.

Asian Disease Problem original
Used for: framing effect design (B2.1)

Engel, C. (2011). "Dictator games: A meta study." Experimental Economics, 14(4), 583–610.

Meta-analysis of 129 dictator game studies, n=20,813
Used for: altruism benchmarking

Dohmen, T., Falk, A., Huffman, D., Sunde, U., Schupp, J., & Wagner, G.G. (2011). "Individual Risk Attitudes: Measurement, Determinants, and Behavioral Consequences." Journal of the European Economic Association, 9(3), 522–550.

Validation of qualitative risk measures
Used for: GPS risk item validation

Lichtenstein, S., & Fischhoff, B. (1977). "Do those who know more also know more about how much they know?" Organizational Behavior and Human Performance, 20(2), 159–183.

Original calibration/overconfidence paradigm
Used for: B3 overconfidence tasks

Samuelson, W., & Zeckhauser, R. (1988). "Status quo bias in decision making." Journal of Risk and Uncertainty, 1(1), 7–59.

Original status quo bias experiments
Used for: B4 default effect design

Madrian, B.C., & Shea, D.F. (2001). "The Power of Suggestion: Inertia in 401(k) Participation and Savings Behavior." Quarterly Journal of Economics, 116(4), 1149–1187.

Real-world status quo/default effect magnitude
Used for: B4 benchmark effect size

Section 9: Honest Communication Guidelines

These guidelines govern how results are displayed to users. Accuracy about what we can and cannot measure matters both ethically and for scientific credibility.

What we CAN say to individual users:

"Your estimated discount rate is ~22% per year, placing you in the Xth percentile of patience"
"You gave $X in the hypothetical dictator game; the average is $28"
"Your confidence interval captured the true answer / did not capture the true answer"

What we CANNOT say to individual users:

"You are susceptible to anchoring" (requires between-subjects; we only have one data point per user)
"You are susceptible to framing" (same reason)
"You have status quo bias" (same reason)

How to communicate bias results instead:

"You were shown the [HIGH/LOW] anchor version of this question. On average across all users, people shown the high anchor estimate 35% higher than those shown the low anchor — a classic anchoring effect. Your estimate was [X]."

This educates users about the bias without making individual-level claims we can't support.

End of EXPERIMENTAL_DESIGN.md v1.0

Next steps: Workstream C (Database Schema) → Workstream B (UI wireframes) → Full-stack implementation