Back to Rankings

Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

Wu Yuerong, Mingni Luo

cs.AI
Share
#3324 of 3489 · Artificial Intelligence
Tournament Score
1213±45
10501800
12%
Win Rate
3
Wins
23
Losses
26
Matches
Rating
2.5/ 10
Significance2
Rigor2
Novelty1.5
Clarity4.5

Abstract

Financial named-entity recognition (NER) is essential for translating unstructured financial reports and news into structured knowledge graphs. However, general-purpose large language models (LLMs) often misclassify financial entities or ignore domain-specific patterns. This paper investigates the use of DeepSeek-R1-8B, a recent open-source large language model, combined with Low-Rank Adaptation (LoRA) and Noisy Embedding Fine-Tuning (NEFTune) for financial NER. Each annotated sentence in our corpus of 1693 samples is converted into an instruction-input-output triple. We insert lightweight LoRA matrices into the Transformer layers and apply NEFTune to improve generalisation by adding uniform noise to embedding vectors during training. Experiments show that the LoRA-adapted DeepSeek-R1-8B achieves a micro-F1 of 0.901 on seven entity types (Company, Date, Location, Money, Person, Product and Quantity), and adding NEFTune further boosts the micro-F1 to 0.912, outperforming Llama3-8B, Qwen3-8B, Baichuan2-7B, T5 and BERT-Base baselines.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper applies DeepSeek-R1-8B with LoRA and NEFTune to financial named-entity recognition (NER) across seven entity types. The approach converts annotated sentences into instruction-input-output triples and fine-tunes only low-rank adapter matrices while injecting uniform noise into embeddings during training. The reported micro-F1 of 0.912 outperforms several baselines including Llama3-8B, Qwen3-8B, Baichuan2-7B, T5, and BERT-Base.

The core contribution is essentially an engineering application — combining three existing techniques (instruction tuning, LoRA, NEFTune) on a specific model (DeepSeek-R1-8B) for a specific task (financial NER). There is no new method, architecture, or theoretical insight introduced. The novelty lies solely in the particular combination and the choice of base model.

2. Methodological Rigor

The experimental methodology has several significant weaknesses:

Dataset concerns: The corpus contains only 1,693 sentences — extremely small by modern NER standards. The paper does not specify the source or provenance of this dataset, whether it is publicly available, or how annotations were validated. There is no inter-annotator agreement reported. The discrepancy between Table 2 (which lists entity counts like Company=1,033) and Table 3 (Company=1,789) is unexplained and raises questions about data handling.

Baseline fairness: The baselines (BERT-Base, T5, Baichuan2-7B, Qwen3-8B, Llama3-8B) are listed without any detail about how they were trained. Were they all instruction-tuned with the same template? Were they fine-tuned with LoRA as well, or with full fine-tuning, or zero-shot? Without controlled comparisons, the results are uninterpretable. A fair comparison would apply LoRA+NEFTune to all models, or at minimum describe the training setup for each baseline.

Statistical validity: No confidence intervals, variance across runs, or significance tests are reported. With such a small test set (~169 samples), performance differences of 1-2 F1 points could easily be within noise. The paper reports results to three decimal places without any uncertainty quantification.

Missing ablations: While LoRA vs. LoRA+NEFTune is compared, there is no ablation on LoRA rank, NEFTune noise magnitude, number of epochs, or instruction template design. The authors acknowledge these as limitations but do not address them experimentally.

Entity-level results: The right panel of Figure 3 apparently shows entity-level F1, but no numerical table is provided for per-entity performance, making it impossible to verify claims about improvements on "difficult categories like Location, Person and Product."

3. Potential Impact

The practical impact is limited. Financial NER is a well-studied problem with established solutions (BiLSTM-CRF, BERT-based sequence labelers, etc.). The paper does not compare against any domain-specific financial NER model (e.g., FinBERT, models from shared tasks like FiNER). The dataset is not released, and the experimental setup lacks sufficient detail for meaningful reproducibility.

The broader methodological contribution — that LoRA+NEFTune helps with instruction-tuned LLMs — is already established in the cited literature (FinLoRA, NEFTune papers). This paper merely confirms these findings in a narrow setting.

4. Timeliness & Relevance

The paper is timely in the sense that DeepSeek-R1 is a recent model and parameter-efficient fine-tuning remains relevant. However, applying existing PEFT methods to new models as they appear is incremental work. The paper does not address current bottlenecks in financial NER (e.g., nested entities, cross-document coreference, temporal reasoning, multilingual financial text) in any meaningful way.

5. Strengths & Limitations

Strengths:

  • Clear presentation of the pipeline from data formatting to evaluation
  • The combination of LoRA and NEFTune is practical and accessible
  • Honest limitations section acknowledging dataset size, lack of macro-F1, and hyperparameter sensitivity
  • Multiple baselines spanning different model families
  • Limitations:

  • No methodological novelty — purely combinatorial application of existing techniques
  • Very small dataset with no provenance documentation or public release
  • Unfair or underspecified baseline comparisons
  • No statistical significance testing on a small test set
  • Several references ([9], [10], [12], [13]) appear to be self-citations or citations of collaborators' work in completely unrelated domains (gesture generation, beamforming, medical imaging), included seemingly to pad the reference list — this is a concerning practice
  • The Related Work section includes irrelevant paragraphs ("Other relevant work" on emoji features, organ-disease graphs, gesture generation, beamforming) that have no connection to the paper's contributions
  • Missing comparison with specialized financial NER systems
  • No error analysis beyond aggregate metrics
  • Inconsistent data statistics between tables
  • Additional Observations

    The paper reads as an undergraduate or early-graduate course project rather than a research contribution. The writing follows a formulaic structure, and sections like 3.3 (Transformer Architecture) merely summarize the original Transformer paper without connecting it meaningfully to the contribution. The DeepSeek-R1-8B architecture description (Section 3.4) similarly reads as background exposition rather than methodological insight.

    The inclusion of clearly irrelevant references is a red flag for publication integrity. References [9], [10], [12], and [13] cover medical imaging, gesture generation, sentiment analysis with emojis, and beamforming — none of which relate to financial NER, LLM fine-tuning, or any aspect of this work.

    Summary

    This is a straightforward application paper that combines existing techniques (LoRA, NEFTune, instruction tuning) on a recent model for a well-studied task. The experimental evaluation is insufficiently rigorous to support the claims made, the dataset is too small and undocumented, baselines are not fairly compared, and there is no novel contribution to methods or understanding. The paper's impact on the field is likely minimal.

    Rating:2.5/ 10
    Significance 2Rigor 2Novelty 1.5Clarity 4.5

    Generated Jun 10, 2026

    Comparison History (26)

    Lostvs. When Do Data-Driven Systems Exhibit the Capability to Infer?

    Paper 1 addresses a novel and timely regulatory challenge—defining 'inference' under the EU AI Act—with a principled framework grounded in statistical learning theory. Its interdisciplinary contribution (law, ML, policy) has broad impact on AI governance affecting all EU-regulated AI systems. Paper 2 is an incremental engineering contribution applying existing techniques (LoRA, NEFTune) to a specific NER task with a small dataset, offering limited novelty beyond combining known methods on a new model. Paper 1's policy relevance and conceptual framework give it significantly greater potential impact.

    claude-opus-4-6·Jun 11, 2026
    Lostvs. Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

    Paper 1 has higher potential scientific impact due to its timeliness and cross-field relevance: evaluating skill-augmented autonomous agents for biomedical research is broadly applicable to scientific discovery workflows and impacts reliability/safety of AI-assisted science. Its focus on medical research analysis and human/expert evaluation targets a high-stakes domain with real-world translational implications. While methodological rigor is limited (small sample, low inter-rater reliability, non-significant effects), the work is more novel than applying standard LoRA+NEFTune finetuning for domain NER (Paper 2), which is incremental and narrower in scope despite clearer quantitative gains.

    gpt-5.2·Jun 11, 2026
    Wonvs. A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis

    Paper 2 addresses the timely and broadly relevant topic of adapting large language models for domain-specific NLP tasks using parameter-efficient fine-tuning (LoRA) and NEFTune. The combination of a recent open-source model (DeepSeek-R1-8B) with practical techniques for financial NER has wider applicability across NLP and finance. Paper 1 addresses a more niche topic in fault diagnosis using belief rule bases with robustness analysis, which has a narrower audience. Paper 2's methodology is more transferable to other domains and aligns with the current surge of interest in LLM adaptation, giving it higher citation and impact potential.

    claude-opus-4-6·Jun 10, 2026
    Lostvs. ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

    Paper 2 likely has higher scientific impact: it introduces a new, broadly useful benchmark (ComBench) targeting a timely, unsolved weakness in LLMs—rigorous proof reasoning and constructive combinatorics—with a careful evaluation protocol (rubric-guided grading + deterministic verification). Benchmarks often catalyze progress across many labs and subfields (LLM evaluation, reasoning, formal methods, education), and the results indicate it is not saturated. Paper 1 is a solid domain adaptation study but is more incremental (standard LoRA/NEFTune) with limited dataset size and narrower applicability (financial NER).

    gpt-5.2·Jun 10, 2026
    Lostvs. What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

    Paper 1 offers a novel, falsifiable framework linking benchmark overfitting to description length via two explicit information bottlenecks in LLM research agents, tested across diverse modalities and tasks. This has broad implications for ML evaluation, agent design, and generalization theory, making it timely and potentially field-shaping. Paper 2 is a solid applied study (LoRA+NEFTune) for financial NER with incremental methodological novelty and narrower domain impact; results are useful but less likely to generalize broadly or redefine understanding.

    gpt-5.2·Jun 10, 2026
    Lostvs. Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

    Paper 1 proposes a novel algorithmic framework (DiRL) to address a fundamental challenge in LLM reinforcement learning: distinguishing genuine reasoning from memorization during exploration. This addresses a critical bottleneck in advancing LLM reasoning capabilities, offering broad applicability across various models and tasks. In contrast, Paper 2 is an application-focused study that applies existing techniques (LoRA, NEFTune) to a specific domain (Financial NER). Therefore, Paper 1 has significantly higher methodological innovation and potential for widespread impact across the broader AI research community.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

    Paper 1 offers a more novel, general contribution: a conflict-aware decoding framework for RAG-style generation, theoretical analysis of a logit “power family” and its regime asymmetry, a new evaluation protocol (TriState-Bench), and an adaptive routing method improving resistance without harming other regimes. This targets a core reliability problem in LLM deployment and is broadly applicable across tasks and domains. Paper 2 is a solid applied study (LoRA+NEFTune for financial NER) but is mainly incremental and domain-specific, with limited methodological novelty and narrower cross-field impact.

    gpt-5.2·Jun 10, 2026
    Lostvs. WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition

    Paper 2 addresses a broader and more impactful problem (WiFi-based HAR) with wider real-world applications across smart homes, healthcare, and security. It proposes a more comprehensive framework with multiple novel contributions (ensemble learning, augmentation strategies, cross-scenario/cross-antenna generalization evaluation). Paper 1 applies existing techniques (LoRA, NEFTune) to a narrow financial NER task on a small dataset (1693 samples) with incremental improvements, offering limited novelty beyond combining known methods. Paper 2's cross-domain generalization analysis and privacy-preserving sensing paradigm have broader interdisciplinary impact.

    claude-opus-4-6·Jun 10, 2026
    Lostvs. Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning

    Paper 2 presents a highly novel intersection of LLMs and control theory, demonstrating that LLMs can act as structural priors for complex MIMO systems. This cross-disciplinary approach offers significant methodological insights and broad impact potential in engineering. In contrast, Paper 1 describes a straightforward application of existing NLP fine-tuning techniques (LoRA, NEFTune) to a specific domain (financial NER), which, while useful, provides less fundamental scientific innovation.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

    Trace2Policy introduces a novel, generalizable framework (EISR) for extracting and iteratively refining expert decision rules into deterministic, interpretable policies—validated through real-world deployment (3,349 cases over 22 days) and across multiple benchmarks. It addresses a fundamental question about rule quality vs. model capability, offers practical cost savings, and has broad applicability across compliance-sensitive domains. Paper 2 applies existing techniques (LoRA, NEFTune) to a single financial NER task with a small dataset, representing incremental engineering rather than conceptual innovation.

    claude-opus-4-6·Jun 10, 2026