Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

Minmin Zhang, Sina Aghaei, Soroush Saghafian

#1554 of 4847 · cs.LG
Share
Tournament Score
1447±33
10501800
70%
Win Rate
23
Wins
10
Losses
33
Matches
Rating
4.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities, yet their potential for sequential decision-making remains underexplored. In this paper, we study the ICL capabilities of LLMs in sequential decision-making settings, including Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), and Ambiguous POMDPs (APOMDPs). We fine-tune pretrained LLMs to perform few-shot decision-making directly from offline, oracle-labeled trajectories. Our framework enables flexible imitation of policies through supervised fine-tuning (SFT). Theoretically, we focus on linear MDPs and interpret a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data. Building on this interpretation, we derive an end-to-end suboptimality bound for the induced policy that separates the in-context estimation error from the training-length bias. Empirically, across synthetic MDP, POMDP, and APOMDP settings, we find that fine-tuned LLMs achieve substantially smaller optimality gaps than in-context-only and random baselines, with especially large gains in longer-horizon, partially observed, and model-ambiguous environments. Together, these results show that supervised fine-tuning provides an effective route to endowing pretrained LLMs with sequential decision-making capabilities from offline data, which is an important advantage in domains such as healthcare where offline data are abundant.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper proposes a framework for adapting pretrained LLMs to sequential decision-making tasks (MDPs, POMDPs, and APOMDPs) via supervised fine-tuning with QLoRA on oracle-labeled offline trajectories. The key idea is that rather than training transformers from scratch (as in Decision Transformer or DPT), leveraging pretrained LLM representations and fine-tuning them on serialized trajectories enables better few-shot in-context decision-making at test time. The paper offers both theoretical analysis (for linear MDPs with a single linear self-attention layer) and empirical evaluation across three increasingly complex settings.

The novelty lies in the combination of three elements: (1) fine-tuning pretrained LLMs rather than training from scratch, (2) extending evaluation to POMDPs and APOMDPs (which incorporate partial observability and model ambiguity), and (3) providing a theoretical suboptimality bound that decomposes into in-context estimation error and training-length bias.

Methodological Rigor

Theoretical analysis: The theoretical contribution builds on the linear regression in-context learning framework of Zhang et al. (2024). The authors interpret a trained linear self-attention layer as implicitly computing a shrinkage-corrected estimator of optimal Q-function parameters, then lift this to an end-to-end policy suboptimality bound (Proposition 1). The analysis is technically sound but relies on several strong assumptions: linear MDPs, a single linear self-attention layer, fixed feature covariance across tasks, noiseless labels, and a Gaussian feature model. The "on-policy query calibration" assumption (Eq. 7) is particularly notable—it essentially assumes the prediction error transfers from the training distribution to on-policy queries, which is the very covariate shift problem the authors acknowledge but sidestep. The gap between the theoretical model (single LSA layer on linear MDPs) and the empirical model (full Llama-2-7B on tabular MDPs/POMDPs/APOMDPs) is substantial, limiting the theory's explanatory power for the observed empirical results.

Empirical evaluation: The experiments use a synthetic "energy-management" task with discrete states and actions. While the MDP/POMDP/APOMDP hierarchy provides a nice progression, the task is relatively simple (E+1=10 states, 3 actions). The evaluation metric (optimality gap) is sensible. However, there are several concerns:

  • The baselines are weak: only random policy, ICL-only (no fine-tuning), and DPT. No comparison with standard offline RL methods or meta-RL approaches.
  • The ICL-only baseline uses the same pretrained LLM without fine-tuning, but since the base Llama-2 was never trained on RL trajectory data, it's unsurprising it performs poorly.
  • Confidence intervals are sometimes wide, particularly for APOMDP results.
  • The Darkroom experiment is a welcome addition but limited (only 20 test goals).
  • Potential Impact

    The paper addresses an important direction—bridging LLMs and sequential decision-making—which is timely given the rapid expansion of LLM capabilities. The motivation for healthcare applications (offline data, model ambiguity from unobserved confounders) is compelling in principle. However, several factors limit near-term impact:

    1. Oracle dependency: The framework requires access to optimal or near-optimal policies for generating training data, which is a strong requirement and somewhat circular—if you can compute optimal policies, the decision-making problem is largely solved.

    2. Synthetic-only evaluation: Without real-world validation, the practical utility remains speculative. The gap between 10-state synthetic environments and clinical decision-making is enormous.

    3. Scalability concerns: The approach requires fine-tuning a 7B parameter model on task-specific trajectory data. The computational cost (up to 3-4 days per run on A100) may limit practical adoption when simpler methods could work.

    4. Limited advantage over simpler approaches: For the tabular problems studied, classical RL methods (backward induction, belief-state methods) already solve these problems optimally. The value proposition of using an LLM here is unclear.

    Timeliness & Relevance

    The paper is timely in connecting LLMs to RL/decision-making, which is an active research area. The extension to POMDPs and APOMDPs is relevant for real-world applications where full observability is unrealistic. The APOMDP framework connecting to healthcare DTRs addresses a genuine need. However, the concurrent work on foundation models for decision-making (e.g., from DeepMind, OpenAI) may quickly supersede the specific approach proposed here.

    Strengths

    1. Systematic progression from MDPs to POMDPs to APOMDPs provides clear evaluation of increasing complexity.

    2. Theoretical decomposition into in-context estimation error and training-length bias, while limited in scope, provides useful intuition about how support and training data contribute differently.

    3. Practical framework: QLoRA-based fine-tuning is parameter-efficient and reproducible.

    4. Robustness analysis: OOD generalization tests and ablation on support trajectory quality are informative additions.

    5. Favorable comparison against DPT across all settings.

    Limitations

    1. Theory-practice gap: The theory covers linear MDPs with a single attention layer; experiments use full Llama-2 on tabular problems. The theory explains little about why the full model works.

    2. Weak baselines: No comparison with offline RL methods, meta-RL, or even simple imitation learning approaches.

    3. Oracle assumption: Requiring optimal policies for training data generation severely limits applicability.

    4. Scale of experiments: Small state/action spaces (10 states, 3 actions) on synthetic tasks.

    5. Unclear LLM advantage: Why use a 7B parameter model for a 10-state MDP? The paper doesn't convincingly argue that pretrained language representations provide meaningful inductive bias for these numerical sequential decision problems.

    6. Missing important ablations: No comparison of LLM sizes, no analysis of what the pretrained representations contribute beyond a randomly initialized transformer of similar size.

    Overall Assessment

    This paper makes a reasonable contribution to an important research direction but falls short of providing compelling evidence that pretrained LLMs offer distinctive advantages for sequential decision-making. The theoretical analysis, while technically correct, applies to a much simpler setting than the experiments. The empirical evaluation, while systematic, uses overly simple tasks with weak baselines. The healthcare motivation is not validated with any real or realistic data. The work is best viewed as a preliminary exploration that identifies an interesting research direction but requires substantially more evidence to demonstrate meaningful scientific impact.

    Rating:4.5/ 10
    Significance 4.5Rigor 4.5Novelty 4.5Clarity 6.5

    Generated May 12, 2026

    Comparison History (33)

    vs. Function-Space ADMM for Decentralized Federated Learning: A Control Theoretic Perspective
    gemini-3.15/12/2026

    Paper 1 bridges two highly active and impactful fields: Large Language Models and offline Reinforcement Learning. By providing both a novel theoretical framework (interpreting attention layers as Q-function estimators) and empirical validation for sequential decision-making in complex environments (POMDPs), it opens a significant new pathway for LLM agent research. Paper 2 presents a strong, rigorous approach to decentralized Federated Learning, but Paper 1's alignment with the explosive interest in LLM reasoning and real-world decision-making (e.g., healthcare) gives it broader cross-disciplinary appeal and higher potential for disruptive scientific impact.

    vs. FedCIGAR: A Personalized Reconstruction Approach for Federated Graph-level Anomaly Detection
    gpt-5.25/12/2026

    Paper 2 likely has higher impact due to broader cross-field relevance (LLMs + RL/decision-making), strong timeliness, and a combination of theoretical guarantees and empirical results. Enabling sequential decision-making via supervised fine-tuning on offline trajectories has clear real-world applications (e.g., healthcare) and could influence both LLM alignment and offline RL. Paper 1 is novel within federated graph anomaly detection and practically relevant for privacy, but its scope is narrower and impact is more domain-specific.

    vs. Predicting Large Model Test Losses with a Noisy Quadratic System
    gemini-3.15/12/2026

    Paper 1 addresses a critical bottleneck in foundation model training: predicting scaling behavior and optimal resource allocation. By outperforming Chinchilla's model and handling changing batch sizes, it offers massive compute and cost savings for large-scale AI development. Its foundational implications for training future models give it a higher potential impact than Paper 2, which presents valuable but narrower theoretical work on sequential decision-making currently limited to synthetic environments.

    vs. Causal Parametric Drift Simulation: A Digital Twin Framework for Classifier Robustness Evaluation
    claude-opus-4.65/12/2026

    Paper 1 addresses a high-impact intersection of LLMs and sequential decision-making with both theoretical contributions (suboptimality bounds for fine-tuned attention layers in linear MDPs) and empirical validation across multiple settings (MDPs, POMDPs, APOMDPs). It tackles a timely and broadly relevant problem with rigorous methodology and clear practical applications (e.g., healthcare). Paper 2 introduces an interesting causal framework for drift evaluation but is narrower in scope, demonstrated on a single dataset, and addresses a more niche problem. Paper 1's breadth, theoretical depth, and timeliness give it higher impact potential.

    vs. RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
    gpt-5.25/12/2026

    Paper 2 likely has higher impact: it proposes a clearly novel, training-free reliability layer targeting a concrete, widespread failure mode in tool-using agents (inter-tool contract violations) and demonstrates large, robust gains across seven models on a standard benchmark with improved latency. This is timely and immediately deployable for real-world agent systems. Paper 1 advances LLMs for sequential decision-making with theory and synthetic experiments, but relies on supervised fine-tuning with oracle-labeled trajectories and has less direct evidence of broad, near-term applicability beyond controlled settings.

    vs. Attention Drift: What Autoregressive Speculative Decoding Models Learn
    claude-opus-4.65/12/2026

    Paper 1 identifies a novel, previously unreported phenomenon (attention drift) in speculative decoding, provides mechanistic understanding tracing it to unnormalized residual paths, and proposes simple architectural fixes yielding significant practical improvements. This addresses a core efficiency bottleneck in LLM inference—a topic of immense current relevance with broad impact. Paper 2 contributes solid but more incremental work applying SFT to improve ICL for sequential decision-making, combining known techniques (fine-tuning, in-context learning) in a relatively expected way. Paper 1's mechanistic insight and practical gains in widely-deployed inference systems give it higher impact potential.

    vs. Near-Future Policy Optimization
    gpt-5.25/12/2026

    Paper 2 is likely to have higher impact due to timeliness and direct applicability to mainstream post-training of foundation models (RLVR/GRPO). NPO/AutoNPO is a broadly usable, implementation-simple recipe that can be plugged into many RL fine-tuning pipelines and model families, potentially affecting widespread practice. Paper 1 is novel in framing LLMs for sequential decision-making with theory on linear MDPs, but its empirical scope appears more synthetic and the application path (offline imitation for decision-making) is narrower and less immediately adoptable at scale than a general-purpose RL post-training optimization method.

    vs. Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning
    gemini-3.15/12/2026

    Paper 2 offers fundamental theoretical contributions by deriving suboptimality bounds for LLM attention layers as Q-function estimators, alongside empirical evidence across complex decision-making environments. This combination of theoretical grounding and broad applicability to offline domains like healthcare gives it a wider potential scientific impact compared to Paper 1, which focuses primarily on the practical engineering bottleneck of RL reward design.

    vs. Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit
    gemini-3.15/12/2026

    Paper 2 addresses the highly active and impactful area of extending Large Language Models to sequential decision-making. By combining theoretical suboptimality bounds with empirical validation across various MDP settings, it bridges offline reinforcement learning and LLM fine-tuning. This has immediate and broad real-world applicability in domains like healthcare and robotics. While Paper 1 provides a rigorous theoretical foundation for Newton's method in overparameterized networks, second-order methods remain computationally prohibitive for modern deep learning, limiting its practical impact compared to the timely advancements in LLM capabilities presented in Paper 2.

    vs. LLM Advertisement based on Neuron Auctions
    claude-opus-4.65/12/2026

    Paper 1 addresses the fundamental and broadly impactful problem of using LLMs for sequential decision-making, providing both theoretical foundations (suboptimality bounds for linear MDPs) and empirical validation across multiple settings. Its contributions span RL, decision-making, and healthcare applications. Paper 2, while creative in proposing neuron-level auctions for LLM advertising, addresses a narrower commercial problem with less scientific breadth. Paper 1's theoretical rigor, wider applicability across fields, and relevance to offline RL give it stronger potential for lasting scientific impact.

    vs. Transformers with Selective Access to Early Representations
    claude-opus-4.65/12/2026

    Paper 1 addresses the high-impact intersection of LLMs and sequential decision-making with both theoretical foundations (suboptimality bounds for linear MDPs) and empirical validation across MDPs, POMDPs, and APOMDPs. Its applicability to domains like healthcare where offline data is abundant gives it significant real-world relevance. Paper 2 proposes an incremental architectural improvement (selective gating of early representations) with modest empirical gains (~1.5 points on retrieval benchmarks). While technically sound, it represents a narrower contribution to Transformer architecture design with less transformative potential.

    vs. Learning Unified Representations of Normalcy for Time Series Anomaly Detection
    claude-opus-4.65/12/2026

    Paper 2 addresses the broadly impactful problem of unsupervised time series anomaly detection with a novel framework (U²AD) combining score-based generative modeling with unified training objectives. It demonstrates state-of-the-art results and early anomaly detection, which has wide real-world applicability (industrial monitoring, cybersecurity, healthcare). Paper 1, while theoretically interesting in connecting LLMs to sequential decision-making with suboptimality bounds, operates in a more niche intersection of LLM fine-tuning and RL, with primarily synthetic experiments. Paper 2's methodological novelty and broader practical applicability give it higher potential impact.

    vs. Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
    gemini-3.15/12/2026

    Paper 1 bridges two highly active fields (LLMs and sequential decision-making/offline RL) with both theoretical guarantees and empirical results. Its approach to using offline data for fine-tuning has immediate, broad real-world applications in domains like healthcare and robotics. While Paper 2 offers valuable theoretical insights into grokking, its focus is constrained to algorithmic tasks (modular arithmetic), giving Paper 1 a broader and more immediate potential impact across multiple disciplines.

    vs. Anchor-guided Hypergraph Condensation with Dual-level Discrimination
    gpt-5.25/12/2026

    Paper 2 has higher potential impact due to stronger timeliness and broader cross-field relevance: it connects LLMs, offline RL/imitation learning, and POMDP decision-making, with both theoretical guarantees (suboptimality bound in linear MDPs) and clear real-world applicability (offline trajectories in domains like healthcare). The idea of improving sequential decision-making via supervised fine-tuning of LLMs is widely reusable across tasks and likely to influence multiple communities. Paper 1 is valuable but more specialized to hypergraph condensation and limited in breadth of downstream adoption.

    vs. PDE-regularized Dynamics-informed Diffusion with Uncertainty-aware Filtering for Long-Horizon Dynamics
    gemini-3.15/12/2026

    Paper 2 addresses the integration of Large Language Models with sequential decision-making, an extremely timely and broad area of AI research. By providing both theoretical bounds and empirical success across complex environments (POMDPs, APOMDPs), it has massive potential for cross-disciplinary impact in robotics, healthcare, and autonomous agents. While Paper 1 presents a rigorous physics-informed approach, the widespread applicability and current explosive interest in LLM-based agents give Paper 2 a higher potential for broad scientific impact.

    vs. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
    claude-opus-4.65/12/2026

    LEAD addresses a highly timely and practically important problem—reducing verbose reasoning in large reasoning models like DeepSeek-R1 and OpenAI o1—which affects virtually all LLM deployments. Its adaptive, dynamic approach to balancing correctness and efficiency during RL training is novel and methodologically rigorous, with strong empirical results across five benchmarks. Paper 2 makes a solid theoretical and empirical contribution on LLM-based sequential decision-making via SFT, but addresses a more niche intersection with limited empirical scope (synthetic settings). LEAD's broader applicability to the rapidly growing reasoning model ecosystem gives it higher potential impact.

    vs. Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning
    gemini-3.15/12/2026

    Paper 2 bridges large language models and sequential decision-making, a highly active and impactful research area. By providing both theoretical suboptimality bounds and empirical evidence for improving in-context learning via fine-tuning, it addresses fundamental capabilities of LLM-based agents. Its potential applications in critical domains like healthcare further elevate its real-world relevance. While Paper 1 offers strong empirical gains in a specific subfield of semi-supervised learning, Paper 2's intersection of LLMs, theoretical rigor, and reinforcement learning likely commands broader cross-disciplinary impact.

    vs. HS-FNO: History-Space Fourier Neural Operator for Non-Markovian Partial Differential Equations
    gemini-3.15/12/2026

    Paper 1 bridges two highly active and influential fields: Large Language Models and sequential decision-making (RL). By providing both theoretical bounds and empirical evidence for improving in-context learning via supervised fine-tuning in MDPs and POMDPs, it paves the way for advanced AI agents in complex real-world domains like healthcare. While Paper 2 offers a strong methodological advance for scientific machine learning and PDE surrogates, Paper 1 has broader applicability, higher timeliness, and appeals to a vastly larger research community, suggesting a significantly higher potential scientific impact.

    vs. Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference
    gpt-5.25/12/2026

    Paper 2 likely has higher impact due to broader cross-field relevance (LLMs + RL/decision-making), clearer real-world application pathways (offline trajectories in high-stakes domains like healthcare), and stronger methodological rigor via both theoretical guarantees (suboptimality bound in linear MDPs) and empirical validation across MDP/POMDP/APOMDP regimes. It also aligns with a timely agenda: adapting foundation models for control and planning from offline data. Paper 1 is novel for diffusion LM decoding flexibility, but its impact may be narrower to DLM decoding and contingent on wider DLM adoption.

    vs. When Independent Sampling Outperforms Agentic Reasoning
    gemini-3.15/12/2026

    Paper 1 bridges LLMs and sequential decision-making with rigorous theoretical guarantees (suboptimality bounds) and empirical validation across complex environments (POMDPs). This foundational framework has broad applicability in critical domains like healthcare. While Paper 2 offers timely empirical insights on inference compute allocation for coding tasks, its scope is narrower and lacks the theoretical depth and broad cross-field potential of Paper 1's contribution to reinforcement learning and foundation models.