Characterizing Model-Native Skills
Feiyang Kang, Mahavir Dabas, Myeongseob Ko, Ruoxi Jia
Abstract
Skills are a natural unit for describing what a language model can do and how its behavior can be changed. However, existing characterizations rely on human-written taxonomies, textual descriptions, or manual profiling pipelines--all external hypotheses about what matters that need not align with the model's internal representations. We argue that when the goal is to intervene on model behavior, skill characterization should be *model-native*: grounded in the model's own representations rather than imposed through external ontologies. We instantiate this view by recovering a compact orthogonal basis from sequence-level activations. The resulting basis is semantically interpretable but need not correspond to any predefined human ontology; instead, it captures axes of behavioral variation that the model itself organizes around. We validate this characterization on reasoning post-training, using the recovered basis for both SFT data selection and inference-time steering. We develop lightweight proxy interventions to identify which directions are most useful for a given model. Across Llama3-8B and Qwen2.5-3B, selecting data along those directions improves Pass@1 by up to 20% on MATH and 41% on AMC, outperforming data selection based on human-characterized skills. Because the basis lives in activation space, the same directions also serve as steering vectors at inference time, improving Pass@8 by up to 4.8% on MATH--an intervention that human-characterized skills cannot support. We further validate the characterization on safety alignment, where selecting adversarial training data for model-native skill coverage rather than textual diversity yields more sample-efficient learning. These results suggest that recovering skills from the model's own representations, rather than imposing them externally, provides a more effective foundation for intervening on model behavior. Codes are open-sourced.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Characterizing Model-Native Skills"
1. Core Contribution
The paper introduces AUTOSKILL, a framework that defines "skills" as principal component directions in a language model's activation space rather than through human-defined taxonomies. The central thesis is that when the goal is to *intervene* on model behavior (via data selection or inference-time steering), skill characterization should be grounded in the model's own representations. The method applies PCA to sequence-level activations (concatenated across all layers), recovers orthogonal directions, and uses these for: (1) SFT data selection, (2) inference-time steering, and (3) safety alignment data curation. Lightweight proxy interventions (pilot fine-tuning on 1k examples or steering) cheaply identify which directions matter before committing to full training.
The conceptual reframing—skills as model-native rather than externally imposed—is the paper's most distinctive contribution. The observation that different models organize the same data along different axes (Table 1) provides concrete evidence that model-specific characterization matters.
2. Methodological Rigor
Strengths in experimental design:
Concerns:
3. Potential Impact
Practical utility: The framework provides a concrete, automated pipeline for data selection that outperforms manual curation in several settings. The ability to reuse the same basis for both data selection and inference-time steering is genuinely useful. The bias-parameter trick for vLLM-compatible steering (Section 3.3/B.2) is a nice engineering contribution.
Conceptual influence: The paper articulates a clean desiderata framework (compositional, compact, grounded) for model-native skill characterization. This framing could influence how the community thinks about data curation and model profiling, particularly as post-training pipelines become more sophisticated.
Limitations on impact: The method's reliance on PCA means it captures only linear structure. The safety experiment (Table 5) shows modest gains (e.g., +37 refusals out of 2000 at budget 100), and the reasoning improvements, while notable, are benchmarked against methods that weren't designed for these specific settings (s1.1 and LIMO were designed for different models/configurations).
4. Timeliness & Relevance
The paper addresses a genuine bottleneck in reasoning post-training: how to select data efficiently without expensive manual curation or reward model training. As open reasoning pipelines (OpenThoughts, s1, LIMO) proliferate, principled data selection becomes increasingly important. The connection between activation-space analysis and practical intervention (data selection + steering) is timely given growing interest in representation engineering.
However, the field is moving rapidly toward RL-based post-training where SFT data selection may become less central. The paper acknowledges but doesn't address how model-native skills interact with RL training dynamics.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Additional observations:
Summary
This paper makes a reasonable conceptual contribution by arguing for model-native skill characterization and demonstrating that even simple PCA-based decomposition outperforms human-defined skill taxonomies for data selection. However, the technical novelty is modest, the statistical rigor of the evaluation could be stronger, and some experimental results partially undermine the core thesis. The work is best understood as establishing a useful baseline and framing for model-native skill characterization, with significant room for more sophisticated instantiations.
Generated May 5, 2026
Comparison History (48)
Paper 1 exposes a critical, potentially life-threatening flaw in current AI safety paradigms (identity-contingent withholding of medical information). Its rigorous, pre-registered clinical validation challenges foundational assumptions in AI alignment, policy, and medical AI. While Paper 2 offers strong technical contributions to model steering, Paper 1's findings on 'iatrogenic harm' have broader, more urgent ethical and real-world implications that could force a paradigm shift in how frontier models are evaluated and aligned for safety.
Paper 2 addresses a critical, highly timely issue: the unintended real-world harms (iatrogenic harm) caused by current AI safety alignment. By providing pre-registered, rigorously validated empirical evidence of identity-contingent withholding in medical contexts, it is poised to significantly impact both technical AI alignment research and broader AI policy. While Paper 1 offers a strong methodological advance in representation learning and model steering, Paper 2's focus on high-stakes, real-world consequences and its exposure of fundamental flaws in current safety paradigms give it broader potential scientific and societal impact.
Paper 1 fundamentally challenges the highly hyped paradigm of autonomous AI scientists, demonstrating through extensive empirical analysis that current systems lack true scientific reasoning. This critical finding will likely shift the field's focus from scaffold engineering toward fundamentally improving model reasoning and evaluation, giving it a broader and more profound impact across AI and scientific disciplines compared to the technical improvements in Paper 2.
Paper 1 presents a fundamentally new paradigm (machine collective intelligence) for autonomous scientific discovery that bridges symbolism and metaheuristics, demonstrating extraordinary results across diverse scientific domains with up to six orders of magnitude improvement in extrapolation. Its breadth of impact spans all empirical sciences, addressing a core bottleneck in AI-driven discovery. Paper 2 makes a solid contribution to model interpretability and steering but is more narrowly focused on LLM behavior characterization. Paper 1's potential to transform how scientific equations are discovered gives it substantially broader and deeper impact.
Paper 2 addresses a fundamental challenge in AI-driven scientific discovery—deriving interpretable governing equations from data—with a novel paradigm combining symbolism and metaheuristics via multi-agent collective intelligence. Its demonstrated results (up to 6 orders of magnitude improvement in extrapolation, massive parameter reduction) across diverse scientific systems suggest broad cross-disciplinary impact in physics, biology, and engineering. While Paper 1 offers valuable contributions to LLM interpretability and steering, its scope is narrower (LLM behavior intervention). Paper 2's potential to transform scientific discovery methodology gives it higher estimated impact.
Paper 1 offers a more methodologically innovative, model-internal approach (recovering an orthogonal activation-space basis for “skills”) with direct, demonstrated capability to both improve training (data selection) and enable inference-time steering—broadly applicable across LLM development, interpretability, and alignment. The empirical gains on standard math benchmarks and safety efficiency, plus generality across models, suggest wide uptake and follow-on work. Paper 2 is timely and practically valuable for auditing prior contamination, but is a simpler protocol-level contribution with narrower core methodological novelty and likely more domain/workflow-specific impact.
Paper 2 addresses a highly timely and widely debated topic—the autonomy and validity of AI scientists. Its large-scale evaluation (25,000 runs) exposes critical flaws in how current LLMs approach scientific reasoning, fundamentally challenging outcome-based evaluations. This paper is likely to have a profound, cross-disciplinary impact, influencing AI development, philosophy of science, and how the broader scientific community adopts AI tools. While Paper 1 offers a strong methodological advance for LLM steering, Paper 2's epistemological critique of AI systems carries broader, paradigm-shifting implications.
Paper 2 likely has higher scientific impact due to a more novel, general, and mechanistic contribution: recovering a compact orthogonal basis from activation space to define “model-native” skills that directly support intervention (data selection and steering). It demonstrates sizable, benchmarked gains across multiple models and tasks (math reasoning, safety alignment) with clear methodological framing and broader applicability to interpretability, training, and alignment. Paper 1 addresses an important practical auditability issue, but the protocol is simpler, more application-specific, and offers measurement rather than new capability, limiting breadth and long-term cross-field impact.
Paper 1 has higher potential impact due to its strong novelty (end-to-end autonomous discovery on real hardware with experimental validation) and broad cross-field implications spanning AI agents, experimental optics, and potential neuromorphic/optical computing. It reports a previously unreported physical mechanism analogous to Transformer attention, suggesting tangible real-world hardware applications. While Paper 2 is methodologically solid and timely for improving LLM training/steering, its contributions are more incremental within ML and likely narrower in broader scientific reach than a validated autonomous discovery milestone.
Paper 1 likely has higher impact due to its strong novelty (end-to-end autonomous discovery with closed-loop real-world experimentation), methodological ambition (long-horizon agentic system validated on a physical optical platform), and broad cross-field implications spanning AI agents, experimental physics, and potential optical computing hardware. It claims discovery and experimental validation of a previously unreported physical mechanism, which—if reproducible—can influence both scientific automation and hardware acceleration. Paper 2 is rigorous and useful for ML practice, but its advances are more incremental and primarily confined to model interpretability/steering within ML.
HealthFormer represents a paradigm-shifting approach to personalized medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements. Its ability to simulate clinical interventions in silico, validated against published RCTs (41/41 correct direction, 30/41 within CIs), and transfer to independent cohorts for disease/mortality prediction has enormous real-world clinical impact potential as a foundation for clinical digital twins. Paper 2 offers valuable methodological contributions to LLM interpretability and steering, but its scope and real-world impact are narrower, focused on improving model training/inference rather than transforming healthcare.
HealthFormer addresses a fundamental challenge in medicine—personalized health modeling and intervention simulation—with broad clinical applications including disease prediction, risk stratification, and digital twins. Its validation across independent cohorts, 30 disease endpoints, and 41 randomized trial comparisons demonstrates strong methodological rigor and generalizability. The potential to simulate clinical interventions in silico could transform drug development and personalized medicine. While Paper 1 offers valuable ML methodology for understanding model representations, Paper 2's direct medical applications, multimodal integration across 667 measurements, and paradigm-shifting 'health world model' concept give it substantially broader and deeper scientific impact.
Paper 1 likely has higher scientific impact due to its broader real-world applications (biomolecular prediction and design with clinical and therapeutic relevance), strong novelty in aligned multimodal generative modeling across sequence/structure/regulation/context, and potential to influence multiple life-science domains (genomics, transcriptomics, proteomics, drug/design). Paper 2 is innovative and timely for LLM interpretability and intervention, but its impact is more field-contained (LLM training/steering) and may face faster iteration/obsolescence. Both seem rigorous; Paper 1’s cross-modal foundation plus experimental/clinical demos increases impact potential.
Paper 2 likely has higher impact due to major real-world applications (genomics, splicing, clinical variants, protein/RNA design) and broad cross-field relevance (ML, genomics, structural biology, drug discovery). Its multimodal generative framework plus a newly curated aligned dataset enables conditional inference/design and state-of-the-art downstream performance, which can translate into tangible biomedical advances. Paper 1 is novel and timely for model interpretability/control and shows strong gains, but its immediate societal/economic impact is narrower and depends on broader adoption in LLM tooling.
Paper 1 addresses specification gaming in RL-trained reasoning models—a timely safety-critical problem as reasoning models (o1, Grok, etc.) are rapidly deployed. It provides systematic empirical evidence linking RL reasoning training to specification gaming, offers an open-source evaluation suite, and produces actionable findings about a fundamental failure mode. Its breadth (8 settings, multiple frontier models) and direct safety implications give it high impact. Paper 2 presents a novel interpretability method with solid results, but its scope is narrower (data selection/steering) and builds on existing mechanistic interpretability ideas, limiting its broader transformative potential.
Paper 1 has higher potential impact due to a more novel, model-internals-based framework (model-native skill bases from activations) that unifies training-data selection and inference-time steering, and extends to safety alignment—broadly relevant across LLM capability/safety, interpretability, and control. Its intervention mechanism (activation-space directions) is more fundamental and likely reusable across models and tasks. Paper 2 is timely and application-oriented for forecasting agents, but leans more toward systems engineering (decomposition + grounding + robust aggregation) with narrower cross-field novelty and potentially more domain-specific evaluation.
Paper 1 addresses specification gaming in reasoning models—a critical and timely AI safety problem as RL-trained reasoning models become widely deployed. Its systematic evaluation across multiple models, identification of RL training as a key driver, and open-sourced benchmark make it highly impactful for the safety community. Paper 2 presents a novel model-native skill characterization approach with solid results, but addresses a more niche technical problem. Paper 1's findings have broader implications for AI alignment, policy, and deployment practices, giving it higher potential impact across the field.
Paper 2 is more novel and broadly impactful: it proposes a general, model-native framework for discovering skill axes directly from activations and demonstrates bidirectional utility (training data selection and inference-time steering) across multiple models and domains (math reasoning, safety). The method is likely to generalize to many intervention settings in interpretability, alignment, and optimization, with clear methodological contributions and open-sourced code. Paper 1 has high immediate real-world relevance, but its impact is more application/deployment-specific and may be less transferable scientifically than a new representation-grounded intervention paradigm.
Paper 1 offers a novel, generalizable methodology (model-native skill bases from activations) with clear, quantified gains and dual use for both training data selection and inference-time steering—likely to influence multiple LLM intervention areas (capabilities, alignment, interpretability). Its approach is broadly applicable across models/tasks and advances mechanistic control. Paper 2 is timely and high-profile with strong real-world relevance, but its impact may be narrower (peer-review operations/policy), more deployment- and survey-dependent, and potentially harder to generalize scientifically beyond the specific system and setting.
Paper 1 is more novel and broadly impactful: it proposes a model-native, activation-space basis for “skills,” enabling both training-data selection and inference-time steering from the same discovered directions—an approach that generalizes across capabilities (reasoning) and alignment (safety) and connects to mechanistic interpretability and controllable learning. It reports large gains on standard benchmarks and offers a reusable intervention primitive. Paper 2 is timely and application-oriented, but its SPR/agent architecture resembles existing structured deliberation + tool-grounding + ensembling paradigms, with impact likely more domain-specific and dependent on engineering choices.