AI Native Developer News

Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR

The introduction of arrol, an online rollout pruning method, enhances the efficiency and accuracy of Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Models. By allowing early pruning of rollouts during generation, it significantly speeds up training and improves accuracy, thus providing developers with a more efficient approach to optimizing LLMs.

arXiv CS.CL·5d ago

ai-coding-toolsai-modelsai-research

A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning

The paper presents A-SelecT, an innovative technique for automatic timestep selection in Diffusion Transformer (DiT) representation learning, aimed at enhancing training efficiency and representational capacity. A-SelecT identifies the most informative timesteps during a single run, effectively removing the need for extensive exhaustive searches. Experimental results indicate that DiT, augmented by A-SelecT, outperforms previous diffusion models in both classification and segmentation tasks. This advancement highlights its potential for improving discriminative tasks through enhanced generative pre-training.

arXiv CS.AI·2d ago

ai-research

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

The article presents a new benchmark methodology for evaluating repository-aware software engineering systems, addressing issues of synthetic task design and prompt leakage. It uses a time-consistent approach, evaluating code knowledge generated before a specific time (T0) on engineering tasks derived from future pull requests. The study reported baseline results using three Claude-family models across two open-source repositories, DragonFly and React, achieving file-level F1 scores of 0.8081 and 0.8078 respectively, demonstrating the importance of prompt construction as a benchmark variable.

arXiv CS.SE·2d ago

ai-research

A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task

The article introduces RACE-bench, a new benchmark for evaluating repository-level code agents in feature addition tasks, consisting of 528 instances sourced from 12 open-source repositories. This framework evaluates agents not just on patch correctness but also on the quality of their intermediate reasoning, revealing that success rates for different agents range from 29% to 70%. Analysis indicates that while agents are good at understanding high-level intent, they struggle significantly in translating that intent into actionable implementation steps, with a 35.7% decrease in reasoning recall and a 94.1% increase in over-prediction in cases where apply was successful but tests failed. These findings emphasize the need for comprehensive evaluation of code agents beyond mere correctness of final outputs.

arXiv CS.SE·2d ago

ai-research

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

The article introduces SWE-PRBench, a benchmark consisting of 350 pull requests used to evaluate AI code review quality. The assessments reveal that eight advanced models only detect 15-31% of issues flagged by human reviewers, indicating that AI code review is significantly less effective than human performance. The study examined three configurations for context provision, finding that models consistently underperformed when context increased. Notably, it was found that the best-performing models achieved mean scores between 0.147 to 0.153, while a clear gap was observed with the remaining models, which scored 0.113 or lower. The dataset and framework used for evaluation are publicly accessible.

arXiv CS.SE·2d ago

ai-researchai-models

Understanding NPM Malicious Package Detection: A Benchmark-Driven Empirical Analysis

This empirical analysis investigates the detection of malicious packages in the NPM ecosystem, which has become increasingly targeted by software supply chain attacks. A dataset of 6,420 malicious and 7,288 benign packages was created, categorizing behaviors across 11 types and evaluating eight detection tools. Key findings include that GuardDog achieved the highest balance in performance with a 93.32% F1 score, and effective tool combination strategies can reach accuracies of 96.08%. The analysis also highlights that the lack of mandatory scanning allows most malware to operate without evasion techniques.

arXiv CS.SE·1d ago

ai-research

The Specification as Quality Gate: Three Hypotheses on AI-Assisted Code Review

The paper presents three hypotheses on the efficacy of AI-assisted code reviews, arguing against the reliance on AI reviewers without external executable specifications. It suggests that correlated errors in homogeneous language model (LLM) pipelines reinforce rather than mitigate issues, supported by experiments involving models such as Claude and a planted bug corpus. Furthermore, it posits that executable specifications can change the complexity of problem domains, rendering AI review as a tool primarily for residual defect classes. The outlined argument advocates for a layered approach in code review: specifications should come first, followed by verification and AI review as a last step.

arXiv CS.SE·2d ago

ai-research

DesignWeaver: Dimensional Scaffolding for Text-to-Image Product Design

The paper presents DesignWeaver, a novel interface designed to assist novice product designers in crafting effective prompts for text-to-image models. A study with 52 novice participants demonstrated that using DesignWeaver allowed them to create longer and more domain-specific prompts, leading to a generation of diverse and innovative product designs. However, the increased complexity of the prompts also raised participants' expectations, which current models struggled to meet. The findings highlight the importance of visual references in design discussions and the potential of AI-based tools to support novice designers.

arXiv CS.AI·2d ago

ai-research

The least surprising chapter of the Manus story is what’s happening right now

The article discusses the ongoing implications of a recent tie-up in the AI landscape, hinting at a reckoning for developers and businesses involved. This is significant for AI developers as it may influence regulatory actions and industry standards moving forward.

TechCrunch - AI·6d ago

ai-newsai-research

We Rewrote JSONata with AI in a Day, Saved $500K/Year

The article discusses how AI was used to rapidly rewrite JSONata, resulting in significant cost savings of $500k per year. This demonstrates the potential of AI-driven development tools to enhance productivity and reduce operational expenses for software projects.

Simon Willison·5d ago

ai-coding-toolsai-researchopen-source