[TMS+26] — Act or Clarify? Uncertainty and cost in communication #
When should an agent ask a clarification question (CQ) rather than act under uncertainty? [TMS+26] predict and confirm that CQ rates depend on contextual uncertainty (ε) about the interlocutor's goal and the cost (δ) of safe actions, interacting: uncertainty matters most when acting incorrectly is costly.
The paper's computational model is layered decision theory, not RSA: a
decision problem ⟨G, P, R, U⟩ with P(g₂) = ε and exhaustive-answer
utility 1 − δ; a behavioral policy π = SoftMax(α · EU) over the
goal-marginal expected utility EU(r) = Σ_g P(g)·U(g,r); and a CQ gate
P(r_cq) = Logistic(τ · (ExpRegret(r*) − c)), where ExpRegret(r*) is the
expected regret of the best action — the expected value of perfect
information ([RS61]). This file formalises all three
layers, building on Core.DecisionTheory.DecisionProblem; evpi below
identifies the paper's ExpRegret with EVPI.
Main statements #
evpi_eq_min— for the paper's decision problem the expected regret of the best action has the closed formmin ε δ: the uncertainty/cost interaction in one equation.policy_prefers_exh_of_uncertain/policy_prefers_ms1_of_confident— the behavioral policy flips from the matching mention-some to the safe exhaustive answer as uncertainty crosses cost (δ < ε).tl_justAsk,noNeedToAsk,worthAsking,uncertainty_matters_most_when_costly— the gate-level predictions, for everyτ > 0and thresholdc(noNeedToAskis an exact equality, the paper's null prediction).reaction— the layered mixture: clarify with the gate probability, otherwise act bypolicy.L1_exh_transmits_prioretc. — an RSA reinterpretation (this file's extension, not the paper's model): the goal-conditioned speaker — the ε → 0 / post-clarification limit ofπ, [HTB+25]'s action-utility respondent R₁ at β = 1, w_c = 0 — inverted by a Bayesian listener. The paper's model contains no listener.
Implementation notes #
Parameter provenance (Bayesian posterior means fitted to Exp 1): δ_L = 0.32,
δ_S = 0.11 (large/small option space), ε_H = 0.49, ε_L = 0.17 (high/low
uncertainty), τ = 3.60, c = 0.18. The gate-level prediction theorems hold
for all τ > 0 and c, so the fitted values matter only through the
orderings δ_S < ε_L < δ_L < ε_H ≤ 1/2. In the RSA section exhVal = 1 − δ
and the priors 83 : 17 and 51 : 49 are (1 − ε) : ε; α = 1 there is this
file's choice (the paper's α has prior N(5, 1)) — the strict-preference
theorems are stated for any α > 0.
Expected value of perfect information #
[RS61]'s EVPI over a Core.DecisionTheory.DecisionProblem:
the gap between the oracle value (expected utility under perfect
information) and the value of acting now. EVPI is the maximum possible
questionUtility ([vR03a]) — it equals questionUtility on the
identity partition, so any specific clarification question yields at most
EVPI; it is the paper's ExpRegret(r*) and the upper bound on
[DHH+26]'s VoI.
Maximum utility achievable at world w across actions.
With Finset actions, this is sup' over utilities at world w.
Equations
- TsvilodubEtAl2026.bestUtilityAt dp actions w = if h : actions.Nonempty then actions.sup' h (dp.utility w) else 0
Instances For
Oracle value: expected utility under perfect information.
Σ_w P(w) · max_a U(w, a)
Equations
- TsvilodubEtAl2026.oracleValue dp actions = ∑ w : W, dp.prior w * TsvilodubEtAl2026.bestUtilityAt dp actions w
Instances For
Expected value of perfect information (EVPI).
EVPI = oracleValue − dpValue = Σ_w P(w) · max_a U(w,a) − max_a Σ_w P(w) · U(w,a)
Equivalently, the expected regret of the current best action.
Equations
- TsvilodubEtAl2026.evpi dp actions = TsvilodubEtAl2026.oracleValue dp actions - Core.DecisionTheory.dpValue dp actions
Instances For
EVPI is non-negative: acting with perfect information is at least as good as acting without.
Proof sketch: For each action a, its expected utility EU(a) equals
Σ_w P(w) · U(w,a). The oracle value Σ_w P(w) · max_a' U(w,a')
is pointwise ≥ Σ_w P(w) · U(w,a) since max_a' U(w,a') ≥ U(w,a).
Therefore oracleValue ≥ EU(a) for every a, hence
oracleValue ≥ max_a EU(a) = dpValue.
The decision problem ⟨G, P, R, U⟩ #
The questioner's latent goal.
Instances For
Equations
- TsvilodubEtAl2026.instDecidableEqGoal x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯
Equations
- TsvilodubEtAl2026.instReprGoal = { reprPrec := TsvilodubEtAl2026.instReprGoal.repr }
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
Equations
- TsvilodubEtAl2026.instDecidableEqResponse x✝ y✝ = if h : x✝.ctorIdx = y✝.ctorIdx then isTrue ⋯ else isFalse ⋯
Equations
- TsvilodubEtAl2026.instReprResponse = { reprPrec := TsvilodubEtAl2026.instReprResponse.repr }
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- One or more equations did not get rendered due to their size.
The paper's parameterized decision problem: P(g₂) = ε (uncertainty),
matching mention-some worth 1, mismatching 0, exhaustive 1 − δ (cost).
Equations
- One or more equations did not get rendered due to their size.
Instances For
The behavioral policy π = SoftMax(α · EU) #
The paper's π softmaxes the goal-marginal expected utility — the agent
acts under its own uncertainty. Instantiated as the canonical softmax
speaker over the (ε, δ)-indexed state space.
Score for the canonical speaker: α · EU(r) at condition (ε, δ).
Equations
- TsvilodubEtAl2026.policyScore α p r = ↑(α * ↑(Core.DecisionTheory.expectedUtility (TsvilodubEtAl2026.dp p.1 p.2) r))
Instances For
The paper's behavioral policy π(r) = SoftMax(α · EU(r)).
Equations
- TsvilodubEtAl2026.policy α ε δ = RSA.Canonical.S1 (TsvilodubEtAl2026.policyScore α) (ε, δ)
Instances For
The uncertainty flip: once uncertainty exceeds the exhaustive cost
(δ < ε ≤ 1/2), the safe exhaustive answer beats both targeted answers —
the paper's "act safe under uncertainty" effect (fitted:
δ_L = 0.32 < ε_H = 0.49).
Under low uncertainty (ε < δ, ε < 1/2) the matching mention-some
wins (fitted: ε_L = 0.17 < δ_L = 0.32).
Expected regret of the best action: EVPI = min ε δ #
The interaction in closed form: for the paper's decision problem,
the expected regret of the best action — ExpRegret(r*), i.e. the expected
value of perfect information ([RS61], evpi) — is
min ε δ. Regret is bounded by the
uncertainty and by the cost of the safe action, so uncertainty raises
regret only while it stays below the cost: the paper's headline claim that
uncertainty matters most when acting incorrectly is costly.
The CQ gate P(r_cq) = Logistic(τ · (ExpRegret(r*) − c)) #
The logistic CQ gate over the regret signal x.
Equations
- TsvilodubEtAl2026.cqGate τ c x = (1 + Real.exp (-(τ * (x - c))))⁻¹
Instances For
The gate is strictly increasing in expected regret (for τ > 0):
more to lose from acting → more clarification.
The paper's CQ probability at condition (ε, δ): the logistic gate
applied to the expected regret of the best action.
Equations
- TsvilodubEtAl2026.cqProb τ c ε δ = TsvilodubEtAl2026.cqGate τ c ↑(TsvilodubEtAl2026.evpi (TsvilodubEtAl2026.dp ε δ) Finset.univ)
Instances For
Gate-level Experiment 1 predictions #
Fitted condition parameters (Bayesian posterior means): ε_L = 0.17,
ε_H = 0.49 (low/high uncertainty), δ_S = 0.11, δ_L = 0.32 (small/large
option space). The predictions hold for every τ > 0 and threshold c;
only the orderings δ_S < ε_L < δ_L and ε_L < ε_H ≤ 1/2 matter.
Equations
- TsvilodubEtAl2026.deltaSmall = 11 / 100
Instances For
Equations
- TsvilodubEtAl2026.deltaLarge = 32 / 100
Instances For
TL;JustAsk (prediction 1): with a large option space, more clarification under high uncertainty.
NoNeedToAsk (prediction 2): with a small option space, no
difference in clarification between high and low uncertainty — an exact
equality: the regret signal is capped at δ_S in both conditions.
WorthAsking (the cost effect; Exp 2's prediction 7 at the model level): clarification is more likely when the safe action is costlier, at either uncertainty level. The model is fitted to Exp 1, where δ is its only cost knob.
The interaction (the paper's headline): uncertainty raises clarification when the safe action is costly (large option space) and has no effect when it is cheap (small option space).
The shared decision-rule instance: a soft gate #
The logistic gate instantiates DongEtAl2026.ClarifyRule as a soft
threshold, against [DHH+26]'s sharp DongEtAl2026.sharpRule: it
is never binary — at zero net value it clarifies with probability 1/2
(softGateRule_apply_zero, vs DongEtAl2026.sharpRule_binary). Cf. the
paper's Exp 2 contrast between binarized CQ rates and gradient action
rates.
The logistic gate as a ClarifyRule over the net regret signal.
Equations
- TsvilodubEtAl2026.softGateRule hτ = { propensity := TsvilodubEtAl2026.cqGate τ 0, mono := ⋯, nonneg := ⋯, le_one := ⋯ }
Instances For
cqProb is the soft rule applied to the net signal ExpRegret − c.
Unlike the sharp rule, the soft gate is never binary: at zero net value it clarifies with probability 1/2.
The layered reaction policy #
Equations
- TsvilodubEtAl2026.instDecidableEqReaction.decEq TsvilodubEtAl2026.Reaction.cq TsvilodubEtAl2026.Reaction.cq = isTrue ⋯
- TsvilodubEtAl2026.instDecidableEqReaction.decEq TsvilodubEtAl2026.Reaction.cq (TsvilodubEtAl2026.Reaction.act r) = isFalse ⋯
- TsvilodubEtAl2026.instDecidableEqReaction.decEq (TsvilodubEtAl2026.Reaction.act r) TsvilodubEtAl2026.Reaction.cq = isFalse ⋯
- TsvilodubEtAl2026.instDecidableEqReaction.decEq (TsvilodubEtAl2026.Reaction.act a) (TsvilodubEtAl2026.Reaction.act b) = if h : a = b then h ▸ isTrue ⋯ else isFalse ⋯
Instances For
Equations
- One or more equations did not get rendered due to their size.
Instances For
Equations
- TsvilodubEtAl2026.instReprReaction = { reprPrec := TsvilodubEtAl2026.instReprReaction.repr }
The paper's layered mixture: clarify with probability q, otherwise
act according to the behavioral policy.
Equations
- TsvilodubEtAl2026.layered q hq pol = (PMF.bernoulli q hq).bind fun (ask : Bool) => if ask = true then PMF.pure TsvilodubEtAl2026.Reaction.cq else PMF.map TsvilodubEtAl2026.Reaction.act pol
Instances For
The full reaction policy at condition (ε, δ): gate by the logistic of
the expected regret, then act by the softmax policy.
Equations
- TsvilodubEtAl2026.reaction τ c α ε δ = TsvilodubEtAl2026.layered (TsvilodubEtAl2026.cqProb τ c ε δ).toNNReal ⋯ (TsvilodubEtAl2026.policy α ε δ)
Instances For
RSA reinterpretation: the post-clarification speaker and a listener #
Everything below is this file's extension, not the paper's model. The
goal-conditioned speaker — softmax of U(g, ·) with inapplicable
responses gated to ⊥ — is the ε → 0 / post-clarification limit of π,
and coincides with [HTB+25]'s action-utility respondent R₁ at
β = 1, w_c = 0 (their responseTruth gate). Inverting it with a Bayesian
listener connects the model to the canonical RSA pipeline; the paper itself
has no listener.
Does response r address goal g? ([HTB+25]'s responseTruth.)
Equations
- TsvilodubEtAl2026.respApplies TsvilodubEtAl2026.Response.ms1 TsvilodubEtAl2026.Goal.g₁ = true
- TsvilodubEtAl2026.respApplies TsvilodubEtAl2026.Response.ms2 TsvilodubEtAl2026.Goal.g₂ = true
- TsvilodubEtAl2026.respApplies TsvilodubEtAl2026.Response.exh x✝ = true
- TsvilodubEtAl2026.respApplies x✝¹ x✝ = false
Instances For
Action value U(g, r) as a real: 1 for matching mention-some,
exhVal = 1 − δ for exhaustive, 0 for a mismatch.
Equations
- TsvilodubEtAl2026.actVal exhVal TsvilodubEtAl2026.Response.ms1 TsvilodubEtAl2026.Goal.g₁ = 1
- TsvilodubEtAl2026.actVal exhVal TsvilodubEtAl2026.Response.ms2 TsvilodubEtAl2026.Goal.g₂ = 1
- TsvilodubEtAl2026.actVal exhVal TsvilodubEtAl2026.Response.exh x✝ = exhVal
- TsvilodubEtAl2026.actVal exhVal x✝¹ x✝ = 0
Instances For
Goal-conditioned speaker utility: α · U(g, r) on applicable
responses, ⊥ (softmax weight 0) otherwise.
Equations
- TsvilodubEtAl2026.util exhVal α g r = if TsvilodubEtAl2026.respApplies r g = true then ↑(α * TsvilodubEtAl2026.actVal exhVal r g) else ⊥
Instances For
The goal-conditioned (post-clarification) speaker.
Equations
- TsvilodubEtAl2026.speaker exhVal α = RSA.Canonical.S1 (TsvilodubEtAl2026.util exhVal α)
Instances For
Knowing the goal, the speaker prefers the targeted answer over the
exhaustive one whenever the exhaustive answer carries any cost
(exhVal < 1), for any α > 0.
The speaker never produces a mismatching answer: its utility is ⊥.
The cross-condition δ-sensitivity of the exhaustive answer
(S1(exh | g₁, δ_S) > S1(exh | g₁, δ_L): exhaustive answers are more
viable when the option space is small) is Exp 1's option-space cost effect
(predictions 3–4; the credible negative effect of space size on exhaustive
rates). It is a cross-model comparison outside the single-distribution
S1_prefers_iff vocabulary; at the policy level the same effect is eu_exh
monotone in δ.
The questioner's prior over goals, (1 − ε) : ε, normalised.
Equations
- TsvilodubEtAl2026.priorWeight pg1 pg2 TsvilodubEtAl2026.Goal.g₁ = ENNReal.ofReal pg1
- TsvilodubEtAl2026.priorWeight pg1 pg2 TsvilodubEtAl2026.Goal.g₂ = ENNReal.ofReal pg2
Instances For
The world prior, normalised to a PMF.
Equations
- TsvilodubEtAl2026.worldPrior pg1 pg2 h1 _h2 = PMF.normalize (TsvilodubEtAl2026.priorWeight pg1 pg2) ⋯ ⋯
Instances For
The questioner as Bayesian listener (this file's extension): the posterior over goals given the response.
Equations
- TsvilodubEtAl2026.listener exhVal pg1 pg2 α h1 h2 r h = PMF.posterior (TsvilodubEtAl2026.speaker exhVal α) (TsvilodubEtAl2026.worldPrior pg1 pg2 h1 h2) r h
Instances For
The listener hearing ms1 infers g₁ with certainty (ms1 is never produced for g₂). Prior 83 : 17 = (1 − ε_L) : ε_L.
Targeted responses stay fully informative even at high uncertainty (prior 51 : 49 = (1 − ε_H) : ε_H).
The exhaustive answer transmits the prior: S1(exh | ·) is
goal-symmetric, so the listener's posterior given exh is the prior,
83 : 17.