AI Native Medhavi
NewsMCP DirectorySkillsNewsletterSign In

AI Native Developer News

AI development tools, research, and industry news — clustered and ranked by importance.

24h48hWeekMonth
AllFrontier LabsAI Coding ToolsModelsResearchInfrastructureFrameworksNewsCommunityOpen Source
An Empirical Recipe for Universal Phone Recognition

The research paper presents PhoneticXEUS, a phone recognition model trained on extensive multilingual data, achieving state-of-the-art performance with 17.7% PFER on multilingual tasks and 10.6% PFER on accented English speech. The study identifies key factors that affect performance in multilingual phone recognition, including data scale, architecture, and training objectives. By conducting controlled ablations across 100+ languages, the paper quantifies the effects of SSL representations and analyzes error patterns related to language families and articulatory features. The authors have made all data and code openly accessible for further research.

arXiv CS.CL·2h ago
ai-research
Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health

This paper explores the human aspects shaping requirement prioritization in aged-care digital health through a mixed-methods study involving 103 older adults, 105 developers, and 41 caregivers. By employing explainable machine learning, the study identified key human factors related to requirement priorities across eight themes, revealing significant misalignment among stakeholder groups. The research contributes an explainable, human-centric requirements engineering framework, enhancing the inclusiveness of requirements analysis by explicitly engaging various stakeholder perspectives.

arXiv CS.SE·2h ago
ai-research
Knowledge database development by large language models for countermeasures against viruses and marine toxins

This research paper presents the development of comprehensive databases for therapeutic countermeasures against five viruses (Lassa, Marburg, Ebola, Nipah, and Venezuelan equine encephalitis) and marine toxins by utilizing two large language models (LLMs), ChatGPT and Grok. The LLMs were used to identify relevant public databases, collect pertinent information, and cross-validate this data to create user-friendly interactive webpages. The study emphasizes the effectiveness of LLMs in building scalable and updatable knowledge databases that facilitate evidence-based decision-making in medical research.

arXiv CS.AI·2h ago
ai-research
GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

The article introduces GISTBench, a benchmark designed for evaluating Large Language Models' (LLMs) capabilities in understanding user interests within recommendation systems through their interaction histories. It features two new metric families: Interest Groundedness (IG), which includes precision and recall components to penalize hallucination while rewarding coverage, and Interest Specificity (IS), which assesses distinctiveness of LLM-generated user profiles. A synthetic dataset based on real user interactions is released, containing implicit and explicit engagement signals, with validation against user surveys. The evaluation of eight open-weight LLMs, ranging from 7B to 120B parameters, uncovers significant performance bottlenecks, particularly in counting and attributing engagement signals.

arXiv CS.AI·2h ago
ai-researchai-models
SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents

The article presents SciVisAgentBench, a new benchmark for evaluating scientific data analysis and visualization agents, developed in response to the need for a principled evaluation framework in this rapidly evolving field. This benchmark includes 108 expert-crafted cases covering multiple scenarios and is structured around four dimensions: application domain, data type, complexity level, and visualization operation. It introduces a multimodal evaluation pipeline that combines human judgment with various deterministic evaluation methods. A validity study involving 12 SciVis experts was conducted to explore the agreement between human and LLM judges, establishing initial baselines and identifying capability gaps in current SciVis agents.

arXiv CS.AI·2h ago
ai-researchai-models
ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

ChartDiff presents a large-scale benchmark for cross-chart comparative summarization, comprising 8,541 chart pairs sourced from diverse datasets and visual styles. The benchmark includes LLM-generated and human-verified summaries that assess differences in trends, fluctuations, and anomalies. Evaluation results indicate that general-purpose models achieve the highest GPT-based quality, while specialized models perform better in ROUGE scores but struggle with human-aligned evaluation. The study finds that multi-series charts pose challenges regardless of model type, highlighting the difficulties in comparative chart reasoning for current vision-language models.

arXiv CS.AI·2h ago
ai-research
Compiling Code LLMs into Lightweight Executables

The paper introduces Ditto, a method that optimizes Code LLMs for local execution by reducing model size and enhancing inference performance for statically-typed languages like C. Ditto employs a model compression technique inspired by product quantization and integrates a compilation pass into LLVM that replaces unoptimized GEMV operations with implementations from BLAS libraries. The method achieves an impressive 10.5x faster inference and 6.4x lower memory usage compared to original inference pipelines, while maintaining an accuracy loss of only 0.27% in pass@1 when evaluated on three popular Code LLMs.

arXiv CS.SE·2h ago
ai-research
How and Why Agents Can Identify Bug-Introducing Commits

The paper discusses how LLM-based agents improve the identification of bug-introducing commits from fix commits in software repositories, achieving an increase in F1-score from 0.64 to 0.81 on the Linux kernel dataset. This advancement, a significant enhancement of 17 percentage points, comes after two decades of progress since the original method proposed by Sliwerski, Zimmermann, and Zeller in 2005. The study reveals that agents can efficiently search large candidate sets for patterns derived from fix commit diffs and messages, potentially paving the way for advancements in bug detection and root cause analysis.

arXiv CS.SE·2h ago
ai-research
Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

The paper introduces ConSelf, a novel approach for self-improving code generation without external supervision, particularly focusing on large language models. It leverages code semantic entropy as a new metric to gauge functional diversity of program behaviors, allowing for the construction of a curriculum based on the most learnable problems. Additionally, it employs consensus-driven direct preference optimization (Con-DPO) to improve fine-tuning by mitigating noise from self-generated supervision. Empirical results show that ConSelf significantly surpasses baseline methods, demonstrating its effectiveness in enhancing code generation capabilities across various benchmarks.

arXiv CS.SE·2h ago
ai-research
Sustainable AI Assistance Through Digital Sobriety

This research article investigates the sustainability of AI assistants by analyzing a sample of user prompts in software development, revealing that nearly 50% of queries are unnecessary given their expected benefits. It identifies that factoid-style questions make up the largest portion of these unnecessary requests, suggesting that many current uses of AI can be substituted with lower-cost solutions like traditional search methods. The study advocates for further investigation into user behavior and interface nudges to promote more energy-efficient AI usage. The authors call for additional research to replicate and expand upon their preliminary findings regarding AI sustainability from a social perspective.

arXiv CS.SE·2h ago
ai-research
Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback

The study introduces ReLog, an iterative logging generation framework that utilizes runtime feedback to enhance the utility of logs in software debugging. Unlike traditional methods that evaluate logs based on similarity to developer-written logs, ReLog assesses logs through downstream debugging tasks including defect localization and repair. Experimental results indicate that ReLog outperformed all baselines, achieving an F1 score of 0.520 in direct debugging settings and 0.408 in indirect settings with unavailable source code. Additionally, the framework demonstrated its generality across multiple LLMs and highlighted the importance of iterative refinement in the logging process.

arXiv CS.SE·2h ago
ai-research
Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa

Kwame 2.0 is a bilingual generative AI teaching assistant deployed in a human-in-the-loop forum within SuaCode, a coding education initiative for learners across Africa, with a 15-month longitudinal study involving 3,717 enrollments across 35 countries. The system retrieves relevant materials and generates accurate responses, with evaluations showing it provided high-quality support for curriculum-related questions. Community feedback highlighted the effectiveness of human facilitators in mitigating errors, particularly in administrative queries, showcasing the scalability of AI combined with reliable human oversight.

arXiv CS.CL·2h ago
ai-research
MemRerank: Preference Memory for Personalized Product Reranking

The paper introduces MemRerank, a preference memory framework designed to enhance personalized product reranking for LLM-based shopping agents. By distilling user purchase history into concise, query-independent signals, MemRerank improves memory quality and reranking utility. Using reinforcement learning for training, the framework shows a performance increase of up to +10.61 absolute points in 1-in-5 accuracy when compared to traditional methods. An end-to-end benchmark and evaluation framework was developed to assess the system's effectiveness.

arXiv CS.CL·2h ago
ai-research

Latest

  • Visual Studio Code 1.114
    VS Code Blog-657m ago
  • Improve coding agents’ performance with Gemini API Docs MCP and Agent Skills.
    Google Developers Blog-411m ago
  • Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
    arXiv CS.SE2h ago
  • Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping
    arXiv CS.AI2h ago
  • GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
    arXiv CS.AI

Latest

  • Visual Studio Code 1.114
    VS Code Blog-657m ago
  • Improve coding agents’ performance with Gemini API Docs MCP and Agent Skills.
    Google Developers Blog-411m ago
  • Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
    arXiv CS.SE2h ago
  • Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping
    arXiv CS.AI2h ago
  • GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
    arXiv CS.AI
2h ago
  • SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
    arXiv CS.AI2h ago
  • SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
    arXiv CS.CL2h ago
  • Compiling Code LLMs into Lightweight Executables
    arXiv CS.SE2h ago
  • HackRep: A Large-Scale Dataset of GitHub Hackathon Projects
    arXiv CS.SE2h ago
  • Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs
    arXiv CS.CL2h ago
  • From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
    arXiv CS.CL2h ago
  • Practical Feasibility of Sustainable Software Engineering Tools and Techniques
    arXiv CS.SE2h ago
  • ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
    arXiv CS.AI2h ago
  • Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
    arXiv CS.CL2h ago
  • Concept Training for Human-Aligned Language Models
    arXiv CS.CL2h ago
  • BayesInsights: Modelling Software Delivery and Developer Experience with Bayesian Networks at Bloomberg
    arXiv CS.SE2h ago
  • SkillReducer: Optimizing LLM Agent Skills for Token Efficiency
    arXiv CS.SE2h ago
  • Machine Learning in the Wild: Early Evidence of Non-Compliant ML-Automation in Open-Source Software
    arXiv CS.SE2h ago
  • EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback
    arXiv CS.SE2h ago
  • How and Why Agents Can Identify Bug-Introducing Commits
    arXiv CS.SE2h ago
  • Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
    arXiv CS.SE2h ago
  • Sustainable AI Assistance Through Digital Sobriety
    arXiv CS.SE2h ago
  • Software Vulnerability Detection Using a Lightweight Graph Neural Network
    arXiv CS.SE2h ago
  • Designing FSMs Specifications from Requirements with GPT 4.0
    arXiv CS.SE2h ago
  • Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback
    arXiv CS.SE2h ago
  • Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa
    arXiv CS.CL2h ago
  • CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking
    arXiv CS.CL2h ago
  • SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali
    arXiv CS.CL2h ago
  • MemRerank: Preference Memory for Personalized Product Reranking
    arXiv CS.CL2h ago
  • The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
    arXiv CS.CL2h ago
  • 2h ago
  • SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
    arXiv CS.AI2h ago
  • SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
    arXiv CS.CL2h ago
  • Compiling Code LLMs into Lightweight Executables
    arXiv CS.SE2h ago
  • HackRep: A Large-Scale Dataset of GitHub Hackathon Projects
    arXiv CS.SE2h ago
  • Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs
    arXiv CS.CL2h ago
  • From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
    arXiv CS.CL2h ago
  • Practical Feasibility of Sustainable Software Engineering Tools and Techniques
    arXiv CS.SE2h ago
  • ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
    arXiv CS.AI2h ago
  • Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
    arXiv CS.CL2h ago
  • Concept Training for Human-Aligned Language Models
    arXiv CS.CL2h ago
  • BayesInsights: Modelling Software Delivery and Developer Experience with Bayesian Networks at Bloomberg
    arXiv CS.SE2h ago
  • SkillReducer: Optimizing LLM Agent Skills for Token Efficiency
    arXiv CS.SE2h ago
  • Machine Learning in the Wild: Early Evidence of Non-Compliant ML-Automation in Open-Source Software
    arXiv CS.SE2h ago
  • EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback
    arXiv CS.SE2h ago
  • How and Why Agents Can Identify Bug-Introducing Commits
    arXiv CS.SE2h ago
  • Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
    arXiv CS.SE2h ago
  • Sustainable AI Assistance Through Digital Sobriety
    arXiv CS.SE2h ago
  • Software Vulnerability Detection Using a Lightweight Graph Neural Network
    arXiv CS.SE2h ago
  • Designing FSMs Specifications from Requirements with GPT 4.0
    arXiv CS.SE2h ago
  • Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback
    arXiv CS.SE2h ago
  • Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa
    arXiv CS.CL2h ago
  • CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking
    arXiv CS.CL2h ago
  • SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali
    arXiv CS.CL2h ago
  • MemRerank: Preference Memory for Personalized Product Reranking
    arXiv CS.CL2h ago
  • The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
    arXiv CS.CL2h ago