Documentation

Linglib.Studies.TsvilodubEtAl2026

[TMS+26] — Act or Clarify? Uncertainty and cost in communication #

When should an agent ask a clarification question (CQ) rather than act under uncertainty? [TMS+26] predict and confirm that CQ rates depend on contextual uncertainty (ε) about the interlocutor's goal and the cost (δ) of safe actions, interacting: uncertainty matters most when acting incorrectly is costly.

The paper's computational model is layered decision theory, not RSA: a decision problem ⟨G, P, R, U⟩ with P(g₂) = ε and exhaustive-answer utility 1 − δ; a behavioral policy π = SoftMax(α · EU) over the goal-marginal expected utility EU(r) = Σ_g P(g)·U(g,r); and a CQ gate P(r_cq) = Logistic(τ · (ExpRegret(r*) − c)), where ExpRegret(r*) is the expected regret of the best action — the expected value of perfect information ([RS61]). This file formalises all three layers, building on Core.DecisionTheory.DecisionProblem; evpi below identifies the paper's ExpRegret with EVPI.

Main statements #

Implementation notes #

Parameter provenance (Bayesian posterior means fitted to Exp 1): δ_L = 0.32, δ_S = 0.11 (large/small option space), ε_H = 0.49, ε_L = 0.17 (high/low uncertainty), τ = 3.60, c = 0.18. The gate-level prediction theorems hold for all τ > 0 and c, so the fitted values matter only through the orderings δ_S < ε_L < δ_L < ε_H ≤ 1/2. In the RSA section exhVal = 1 − δ and the priors 83 : 17 and 51 : 49 are (1 − ε) : ε; α = 1 there is this file's choice (the paper's α has prior N(5, 1)) — the strict-preference theorems are stated for any α > 0.

Expected value of perfect information #

[RS61]'s EVPI over a Core.DecisionTheory.DecisionProblem: the gap between the oracle value (expected utility under perfect information) and the value of acting now. EVPI is the maximum possible questionUtility ([vR03a]) — it equals questionUtility on the identity partition, so any specific clarification question yields at most EVPI; it is the paper's ExpRegret(r*) and the upper bound on [DHH+26]'s VoI.

def TsvilodubEtAl2026.bestUtilityAt {W : Type u_1} {A : Type u_2} (dp : Core.DecisionTheory.DecisionProblem W A) (actions : Finset A) (w : W) :

Maximum utility achievable at world w across actions.

With Finset actions, this is sup' over utilities at world w.

Equations
Instances For
    def TsvilodubEtAl2026.oracleValue {W : Type u_1} {A : Type u_2} [Fintype W] (dp : Core.DecisionTheory.DecisionProblem W A) (actions : Finset A) :

    Oracle value: expected utility under perfect information. Σ_w P(w) · max_a U(w, a)

    Equations
    Instances For
      def TsvilodubEtAl2026.evpi {W : Type u_1} {A : Type u_2} [Fintype W] [DecidableEq W] (dp : Core.DecisionTheory.DecisionProblem W A) (actions : Finset A) :

      Expected value of perfect information (EVPI).

      EVPI = oracleValue − dpValue = Σ_w P(w) · max_a U(w,a) − max_a Σ_w P(w) · U(w,a)

      Equivalently, the expected regret of the current best action.

      [RS61]

      Equations
      Instances For
        theorem TsvilodubEtAl2026.evpi_nonneg {W : Type u_1} {A : Type u_2} [Fintype W] [DecidableEq W] (dp : Core.DecisionTheory.DecisionProblem W A) (actions : Finset A) (hprior : ∀ (w : W), 0 dp.prior w) (hne : actions.Nonempty) :
        0 evpi dp actions

        EVPI is non-negative: acting with perfect information is at least as good as acting without.

        Proof sketch: For each action a, its expected utility EU(a) equals Σ_w P(w) · U(w,a). The oracle value Σ_w P(w) · max_a' U(w,a') is pointwise ≥ Σ_w P(w) · U(w,a) since max_a' U(w,a') ≥ U(w,a). Therefore oracleValue ≥ EU(a) for every a, hence oracleValue ≥ max_a EU(a) = dpValue.

        The decision problem ⟨G, P, R, U⟩ #

        The questioner's latent goal.

        Instances For
          @[implicit_reducible]
          Equations
          def TsvilodubEtAl2026.instReprGoal.repr :
          GoalStd.Format
          Equations
          • One or more equations did not get rendered due to their size.
          Instances For
            @[implicit_reducible]
            Equations
            • One or more equations did not get rendered due to their size.

            Available non-CQ responses: targeted mention-some answers and the safe exhaustive answer.

            Instances For
              @[implicit_reducible]
              Equations
              Equations
              • One or more equations did not get rendered due to their size.
              Instances For
                @[implicit_reducible]
                Equations
                • One or more equations did not get rendered due to their size.

                The paper's parameterized decision problem: P(g₂) = ε (uncertainty), matching mention-some worth 1, mismatching 0, exhaustive 1 − δ (cost).

                Equations
                • One or more equations did not get rendered due to their size.
                Instances For

                  The behavioral policy π = SoftMax(α · EU) #

                  The paper's π softmaxes the goal-marginal expected utility — the agent acts under its own uncertainty. Instantiated as the canonical softmax speaker over the (ε, δ)-indexed state space.

                  noncomputable def TsvilodubEtAl2026.policyScore (α : ) (p : × ) (r : Response) :
                  EReal

                  Score for the canonical speaker: α · EU(r) at condition (ε, δ).

                  Equations
                  Instances For
                    noncomputable def TsvilodubEtAl2026.policy (α : ) (ε δ : ) :

                    The paper's behavioral policy π(r) = SoftMax(α · EU(r)).

                    Equations
                    Instances For
                      theorem TsvilodubEtAl2026.policy_prefers_exh_of_uncertain {α : } ( : 0 < α) {ε δ : } (h₁ : δ < ε) (h₂ : ε 1 / 2) :
                      (policy α ε δ) Response.ms1 < (policy α ε δ) Response.exh (policy α ε δ) Response.ms2 < (policy α ε δ) Response.exh

                      The uncertainty flip: once uncertainty exceeds the exhaustive cost (δ < ε ≤ 1/2), the safe exhaustive answer beats both targeted answers — the paper's "act safe under uncertainty" effect (fitted: δ_L = 0.32 < ε_H = 0.49).

                      theorem TsvilodubEtAl2026.policy_prefers_ms1_of_confident {α : } ( : 0 < α) {ε δ : } (h₁ : ε < δ) (h₂ : ε < 1 / 2) :
                      (policy α ε δ) Response.exh < (policy α ε δ) Response.ms1 (policy α ε δ) Response.ms2 < (policy α ε δ) Response.ms1

                      Under low uncertainty (ε < δ, ε < 1/2) the matching mention-some wins (fitted: ε_L = 0.17 < δ_L = 0.32).

                      Expected regret of the best action: EVPI = min ε δ #

                      theorem TsvilodubEtAl2026.evpi_eq_min {ε δ : } ( : ε 1 / 2) (hδ0 : 0 δ) :
                      evpi (dp ε δ) Finset.univ = min ε δ

                      The interaction in closed form: for the paper's decision problem, the expected regret of the best action — ExpRegret(r*), i.e. the expected value of perfect information ([RS61], evpi) — is min ε δ. Regret is bounded by the uncertainty and by the cost of the safe action, so uncertainty raises regret only while it stays below the cost: the paper's headline claim that uncertainty matters most when acting incorrectly is costly.

                      The CQ gate P(r_cq) = Logistic(τ · (ExpRegret(r*) − c)) #

                      noncomputable def TsvilodubEtAl2026.cqGate (τ c x : ) :

                      The logistic CQ gate over the regret signal x.

                      Equations
                      Instances For
                        theorem TsvilodubEtAl2026.cqGate_pos (τ c x : ) :
                        0 < cqGate τ c x
                        theorem TsvilodubEtAl2026.cqGate_le_one (τ c x : ) :
                        cqGate τ c x 1
                        theorem TsvilodubEtAl2026.cqGate_strictMono {τ : } ( : 0 < τ) (c : ) :
                        StrictMono (cqGate τ c)

                        The gate is strictly increasing in expected regret (for τ > 0): more to lose from acting → more clarification.

                        noncomputable def TsvilodubEtAl2026.cqProb (τ c : ) (ε δ : ) :

                        The paper's CQ probability at condition (ε, δ): the logistic gate applied to the expected regret of the best action.

                        Equations
                        Instances For

                          Gate-level Experiment 1 predictions #

                          Fitted condition parameters (Bayesian posterior means): ε_L = 0.17, ε_H = 0.49 (low/high uncertainty), δ_S = 0.11, δ_L = 0.32 (small/large option space). The predictions hold for every τ > 0 and threshold c; only the orderings δ_S < ε_L < δ_L and ε_L < ε_H ≤ 1/2 matter.

                          Equations
                          Instances For
                            Equations
                            Instances For
                              Equations
                              Instances For
                                Equations
                                Instances For
                                  theorem TsvilodubEtAl2026.tl_justAsk {τ : } ( : 0 < τ) (c : ) :

                                  TL;JustAsk (prediction 1): with a large option space, more clarification under high uncertainty.

                                  NoNeedToAsk (prediction 2): with a small option space, no difference in clarification between high and low uncertainty — an exact equality: the regret signal is capped at δ_S in both conditions.

                                  theorem TsvilodubEtAl2026.worthAsking {τ : } ( : 0 < τ) (c : ) {ε : } ( : ε 1 / 2) (h : deltaSmall < ε) :

                                  WorthAsking (the cost effect; Exp 2's prediction 7 at the model level): clarification is more likely when the safe action is costlier, at either uncertainty level. The model is fitted to Exp 1, where δ is its only cost knob.

                                  The interaction (the paper's headline): uncertainty raises clarification when the safe action is costly (large option space) and has no effect when it is cheap (small option space).

                                  The shared decision-rule instance: a soft gate #

                                  The logistic gate instantiates DongEtAl2026.ClarifyRule as a soft threshold, against [DHH+26]'s sharp DongEtAl2026.sharpRule: it is never binary — at zero net value it clarifies with probability 1/2 (softGateRule_apply_zero, vs DongEtAl2026.sharpRule_binary). Cf. the paper's Exp 2 contrast between binarized CQ rates and gradient action rates.

                                  noncomputable def TsvilodubEtAl2026.softGateRule {τ : } ( : 0 < τ) :

                                  The logistic gate as a ClarifyRule over the net regret signal.

                                  Equations
                                  Instances For
                                    theorem TsvilodubEtAl2026.cqProb_eq_softGateRule {τ : } ( : 0 < τ) (c : ) (ε δ : ) :
                                    cqProb τ c ε δ = (softGateRule ).propensity ((evpi (dp ε δ) Finset.univ) - c)

                                    cqProb is the soft rule applied to the net signal ExpRegret − c.

                                    theorem TsvilodubEtAl2026.softGateRule_apply_zero {τ : } ( : 0 < τ) :
                                    (softGateRule ).propensity 0 = 1 / 2

                                    Unlike the sharp rule, the soft gate is never binary: at zero net value it clarifies with probability 1/2.

                                    The layered reaction policy #

                                    A reaction: clarify, or commit to a direct response.

                                    Instances For
                                      Equations
                                      • One or more equations did not get rendered due to their size.
                                      Instances For
                                        noncomputable def TsvilodubEtAl2026.layered (q : NNReal) (hq : q 1) (pol : PMF Response) :

                                        The paper's layered mixture: clarify with probability q, otherwise act according to the behavioral policy.

                                        Equations
                                        Instances For
                                          theorem TsvilodubEtAl2026.layered_apply_cq (q : NNReal) (hq : q 1) (pol : PMF Response) :
                                          (layered q hq pol) Reaction.cq = q
                                          theorem TsvilodubEtAl2026.layered_apply_act (q : NNReal) (hq : q 1) (pol : PMF Response) (r : Response) :
                                          (layered q hq pol) (Reaction.act r) = (1 - q) * pol r
                                          noncomputable def TsvilodubEtAl2026.reaction (τ c α : ) (ε δ : ) :

                                          The full reaction policy at condition (ε, δ): gate by the logistic of the expected regret, then act by the softmax policy.

                                          Equations
                                          Instances For

                                            RSA reinterpretation: the post-clarification speaker and a listener #

                                            Everything below is this file's extension, not the paper's model. The goal-conditioned speaker — softmax of U(g, ·) with inapplicable responses gated to — is the ε → 0 / post-clarification limit of π, and coincides with [HTB+25]'s action-utility respondent R₁ at β = 1, w_c = 0 (their responseTruth gate). Inverting it with a Bayesian listener connects the model to the canonical RSA pipeline; the paper itself has no listener.

                                            def TsvilodubEtAl2026.actVal (exhVal : ) :
                                            ResponseGoal

                                            Action value U(g, r) as a real: 1 for matching mention-some, exhVal = 1 − δ for exhaustive, 0 for a mismatch.

                                            Equations
                                            Instances For
                                              noncomputable def TsvilodubEtAl2026.util (exhVal α : ) :
                                              GoalResponseEReal

                                              Goal-conditioned speaker utility: α · U(g, r) on applicable responses, (softmax weight 0) otherwise.

                                              Equations
                                              Instances For
                                                noncomputable def TsvilodubEtAl2026.speaker (exhVal α : ) :
                                                GoalPMF Response

                                                The goal-conditioned (post-clarification) speaker.

                                                Equations
                                                Instances For
                                                  theorem TsvilodubEtAl2026.S1_g1_prefers_ms1 {exhVal α : } ( : 0 < α) (hv : exhVal < 1) :

                                                  Knowing the goal, the speaker prefers the targeted answer over the exhaustive one whenever the exhaustive answer carries any cost (exhVal < 1), for any α > 0.

                                                  The speaker never produces a mismatching answer: its utility is .

                                                  The cross-condition δ-sensitivity of the exhaustive answer (S1(exh | g₁, δ_S) > S1(exh | g₁, δ_L): exhaustive answers are more viable when the option space is small) is Exp 1's option-space cost effect (predictions 3–4; the credible negative effect of space size on exhaustive rates). It is a cross-model comparison outside the single-distribution S1_prefers_iff vocabulary; at the policy level the same effect is eu_exh monotone in δ.

                                                  noncomputable def TsvilodubEtAl2026.priorWeight (pg1 pg2 : ) :
                                                  GoalENNReal

                                                  The questioner's prior over goals, (1 − ε) : ε, normalised.

                                                  Equations
                                                  Instances For
                                                    noncomputable def TsvilodubEtAl2026.worldPrior (pg1 pg2 : ) (h1 : 0 < pg1) (_h2 : 0 < pg2) :
                                                    PMF Goal

                                                    The world prior, normalised to a PMF.

                                                    Equations
                                                    Instances For
                                                      noncomputable def TsvilodubEtAl2026.listener (exhVal pg1 pg2 α : ) (h1 : 0 < pg1) (h2 : 0 < pg2) (r : Response) (h : PMF.marginal (speaker exhVal α) (worldPrior pg1 pg2 h1 h2) r 0) :
                                                      PMF Goal

                                                      The questioner as Bayesian listener (this file's extension): the posterior over goals given the response.

                                                      Equations
                                                      Instances For
                                                        theorem TsvilodubEtAl2026.L1_ms1_infers_g1 :
                                                        (listener (68 / 100) 83 17 1 Response.ms1 ) Goal.g₂ < (listener (68 / 100) 83 17 1 Response.ms1 ) Goal.g₁

                                                        The listener hearing ms1 infers g₁ with certainty (ms1 is never produced for g₂). Prior 83 : 17 = (1 − ε_L) : ε_L.

                                                        theorem TsvilodubEtAl2026.L1_high_ms1_still_certain :
                                                        (listener (68 / 100) 51 49 1 Response.ms1 ) Goal.g₂ < (listener (68 / 100) 51 49 1 Response.ms1 ) Goal.g₁

                                                        Targeted responses stay fully informative even at high uncertainty (prior 51 : 49 = (1 − ε_H) : ε_H).

                                                        theorem TsvilodubEtAl2026.L1_exh_transmits_prior :
                                                        (listener (68 / 100) 83 17 1 Response.exh ) Goal.g₂ < (listener (68 / 100) 83 17 1 Response.exh ) Goal.g₁

                                                        The exhaustive answer transmits the prior: S1(exh | ·) is goal-symmetric, so the listener's posterior given exh is the prior, 83 : 17.