AI Native Medhavi
NewsMCP DirectorySkillsNewsletterSign In

AI Native Developer News

AI development tools, research, and industry news — clustered and ranked by importance.

24h48hWeekMonth
AllFrontier LabsAI Coding ToolsModelsResearchInfrastructureFrameworksNewsCommunityOpen Source
The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

The Kitchen Loop framework revolutionizes software development by enabling autonomous, self-evolving codebases guided by user specifications and robust verification processes. This approach addresses the bottleneck of determining what to build, ensuring high code quality and continuous improvement through autonomous mechanisms.

arXiv CS.SE·5d ago
arXiv CS.SE
ai-coding-toolsai-infraai-research
Stanford study outlines dangers of asking AI chatbots for personal advice

A new study conducted by Stanford computer scientists highlights the potential dangers associated with AI chatbots providing personal advice. The research measures AI's tendency to exhibit sycophantic behaviors, which could lead to harmful decision-making by users relying on these chatbots for personal guidance. It underscores the necessity of caution when users seek emotional or personal advice from AI, especially in contexts where nuanced understanding is crucial. The study's findings could have significant implications for how AI systems are designed and regulated to ensure user safety.

TechCrunch - AI·3d ago
ai-research
PerturbationDrive: A Framework for Perturbation-Based Testing of ADAS

PerturbationDrive is a new framework designed to enhance the testing of Advanced Driver Assistance Systems (ADAS) by evaluating their robustness against various image perturbations. This development is crucial for AI developers focused on creating reliable AI systems that can perform safely in diverse real-world conditions.

arXiv CS.SE·6d ago
arXiv CS.SE
ai-research
Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

The introduction of arrol, an online rollout pruning method, enhances the efficiency and accuracy of Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Models. By allowing early pruning of rollouts during generation, it significantly speeds up training and improves accuracy, thus providing developers with a more efficient approach to optimizing LLMs.

arXiv CS.CL·5d ago
arXiv CS.CL
ai-coding-toolsai-modelsai-research
A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning

The paper presents A-SelecT, an innovative technique for automatic timestep selection in Diffusion Transformer (DiT) representation learning, aimed at enhancing training efficiency and representational capacity. A-SelecT identifies the most informative timesteps during a single run, effectively removing the need for extensive exhaustive searches. Experimental results indicate that DiT, augmented by A-SelecT, outperforms previous diffusion models in both classification and segmentation tasks. This advancement highlights its potential for improving discriminative tasks through enhanced generative pre-training.

arXiv CS.AI·2d ago
ai-research
ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

The article presents a new benchmark methodology for evaluating repository-aware software engineering systems, addressing issues of synthetic task design and prompt leakage. It uses a time-consistent approach, evaluating code knowledge generated before a specific time (T0) on engineering tasks derived from future pull requests. The study reported baseline results using three Claude-family models across two open-source repositories, DragonFly and React, achieving file-level F1 scores of 0.8081 and 0.8078 respectively, demonstrating the importance of prompt construction as a benchmark variable.

arXiv CS.SE·2d ago
ai-research
A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task

The article introduces RACE-bench, a new benchmark for evaluating repository-level code agents in feature addition tasks, consisting of 528 instances sourced from 12 open-source repositories. This framework evaluates agents not just on patch correctness but also on the quality of their intermediate reasoning, revealing that success rates for different agents range from 29% to 70%. Analysis indicates that while agents are good at understanding high-level intent, they struggle significantly in translating that intent into actionable implementation steps, with a 35.7% decrease in reasoning recall and a 94.1% increase in over-prediction in cases where apply was successful but tests failed. These findings emphasize the need for comprehensive evaluation of code agents beyond mere correctness of final outputs.

arXiv CS.SE·2d ago
ai-research
SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

The article introduces SWE-PRBench, a benchmark consisting of 350 pull requests used to evaluate AI code review quality. The assessments reveal that eight advanced models only detect 15-31% of issues flagged by human reviewers, indicating that AI code review is significantly less effective than human performance. The study examined three configurations for context provision, finding that models consistently underperformed when context increased. Notably, it was found that the best-performing models achieved mean scores between 0.147 to 0.153, while a clear gap was observed with the remaining models, which scored 0.113 or lower. The dataset and framework used for evaluation are publicly accessible.

arXiv CS.SE·2d ago
ai-researchai-models
Understanding NPM Malicious Package Detection: A Benchmark-Driven Empirical Analysis

This empirical analysis investigates the detection of malicious packages in the NPM ecosystem, which has become increasingly targeted by software supply chain attacks. A dataset of 6,420 malicious and 7,288 benign packages was created, categorizing behaviors across 11 types and evaluating eight detection tools. Key findings include that GuardDog achieved the highest balance in performance with a 93.32% F1 score, and effective tool combination strategies can reach accuracies of 96.08%. The analysis also highlights that the lack of mandatory scanning allows most malware to operate without evasion techniques.

arXiv CS.SE·1d ago
ai-research
The Specification as Quality Gate: Three Hypotheses on AI-Assisted Code Review

The paper presents three hypotheses on the efficacy of AI-assisted code reviews, arguing against the reliance on AI reviewers without external executable specifications. It suggests that correlated errors in homogeneous language model (LLM) pipelines reinforce rather than mitigate issues, supported by experiments involving models such as Claude and a planted bug corpus. Furthermore, it posits that executable specifications can change the complexity of problem domains, rendering AI review as a tool primarily for residual defect classes. The outlined argument advocates for a layered approach in code review: specifications should come first, followed by verification and AI review as a last step.

arXiv CS.SE·2d ago
ai-research
DesignWeaver: Dimensional Scaffolding for Text-to-Image Product Design

The paper presents DesignWeaver, a novel interface designed to assist novice product designers in crafting effective prompts for text-to-image models. A study with 52 novice participants demonstrated that using DesignWeaver allowed them to create longer and more domain-specific prompts, leading to a generation of diverse and innovative product designs. However, the increased complexity of the prompts also raised participants' expectations, which current models struggled to meet. The findings highlight the importance of visual references in design discussions and the potential of AI-based tools to support novice designers.

arXiv CS.AI·2d ago
ai-research
The least surprising chapter of the Manus story is what’s happening right now

The article discusses the ongoing implications of a recent tie-up in the AI landscape, hinting at a reckoning for developers and businesses involved. This is significant for AI developers as it may influence regulatory actions and industry standards moving forward.

TechCrunch - AI·6d ago
TechCrunch - AI
ai-newsai-research
We Rewrote JSONata with AI in a Day, Saved $500K/Year

The article discusses how AI was used to rapidly rewrite JSONata, resulting in significant cost savings of $500k per year. This demonstrates the potential of AI-driven development tools to enhance productivity and reduce operational expenses for software projects.

Simon Willison·5d ago
Simon Willison
ai-coding-toolsai-researchopen-source

Latest

  • Visual Studio Code 1.114
    VS Code Blog-569m ago
  • Improve coding agents’ performance with Gemini API Docs MCP and Agent Skills.
    Google Developers Blog-323m ago
  • Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
    arXiv CS.SE3h ago
  • Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping
    arXiv CS.AI3h ago
  • GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
    arXiv CS.AI

Latest

  • Visual Studio Code 1.114
    VS Code Blog-569m ago
  • Improve coding agents’ performance with Gemini API Docs MCP and Agent Skills.
    Google Developers Blog-323m ago
  • Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
    arXiv CS.SE3h ago
  • Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping
    arXiv CS.AI3h ago
  • GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
    arXiv CS.AI
3h ago
  • SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
    arXiv CS.AI3h ago
  • SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
    arXiv CS.CL3h ago
  • Compiling Code LLMs into Lightweight Executables
    arXiv CS.SE3h ago
  • HackRep: A Large-Scale Dataset of GitHub Hackathon Projects
    arXiv CS.SE3h ago
  • Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs
    arXiv CS.CL3h ago
  • From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
    arXiv CS.CL3h ago
  • Practical Feasibility of Sustainable Software Engineering Tools and Techniques
    arXiv CS.SE3h ago
  • ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
    arXiv CS.AI3h ago
  • Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
    arXiv CS.CL3h ago
  • Concept Training for Human-Aligned Language Models
    arXiv CS.CL3h ago
  • BayesInsights: Modelling Software Delivery and Developer Experience with Bayesian Networks at Bloomberg
    arXiv CS.SE3h ago
  • SkillReducer: Optimizing LLM Agent Skills for Token Efficiency
    arXiv CS.SE3h ago
  • Machine Learning in the Wild: Early Evidence of Non-Compliant ML-Automation in Open-Source Software
    arXiv CS.SE3h ago
  • EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback
    arXiv CS.SE3h ago
  • How and Why Agents Can Identify Bug-Introducing Commits
    arXiv CS.SE3h ago
  • Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
    arXiv CS.SE3h ago
  • Sustainable AI Assistance Through Digital Sobriety
    arXiv CS.SE3h ago
  • Software Vulnerability Detection Using a Lightweight Graph Neural Network
    arXiv CS.SE3h ago
  • Designing FSMs Specifications from Requirements with GPT 4.0
    arXiv CS.SE3h ago
  • Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback
    arXiv CS.SE3h ago
  • Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa
    arXiv CS.CL3h ago
  • CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking
    arXiv CS.CL3h ago
  • SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali
    arXiv CS.CL3h ago
  • MemRerank: Preference Memory for Personalized Product Reranking
    arXiv CS.CL3h ago
  • The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
    arXiv CS.CL3h ago
  • 3h ago
  • SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
    arXiv CS.AI3h ago
  • SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
    arXiv CS.CL3h ago
  • Compiling Code LLMs into Lightweight Executables
    arXiv CS.SE3h ago
  • HackRep: A Large-Scale Dataset of GitHub Hackathon Projects
    arXiv CS.SE3h ago
  • Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs
    arXiv CS.CL3h ago
  • From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
    arXiv CS.CL3h ago
  • Practical Feasibility of Sustainable Software Engineering Tools and Techniques
    arXiv CS.SE3h ago
  • ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
    arXiv CS.AI3h ago
  • Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
    arXiv CS.CL3h ago
  • Concept Training for Human-Aligned Language Models
    arXiv CS.CL3h ago
  • BayesInsights: Modelling Software Delivery and Developer Experience with Bayesian Networks at Bloomberg
    arXiv CS.SE3h ago
  • SkillReducer: Optimizing LLM Agent Skills for Token Efficiency
    arXiv CS.SE3h ago
  • Machine Learning in the Wild: Early Evidence of Non-Compliant ML-Automation in Open-Source Software
    arXiv CS.SE3h ago
  • EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback
    arXiv CS.SE3h ago
  • How and Why Agents Can Identify Bug-Introducing Commits
    arXiv CS.SE3h ago
  • Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
    arXiv CS.SE3h ago
  • Sustainable AI Assistance Through Digital Sobriety
    arXiv CS.SE3h ago
  • Software Vulnerability Detection Using a Lightweight Graph Neural Network
    arXiv CS.SE3h ago
  • Designing FSMs Specifications from Requirements with GPT 4.0
    arXiv CS.SE3h ago
  • Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback
    arXiv CS.SE3h ago
  • Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa
    arXiv CS.CL3h ago
  • CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking
    arXiv CS.CL3h ago
  • SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali
    arXiv CS.CL3h ago
  • MemRerank: Preference Memory for Personalized Product Reranking
    arXiv CS.CL3h ago
  • The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
    arXiv CS.CL3h ago