AI Native Medhavi
NewsMCP DirectorySkillsNewsletterSign In

AI Native Developer News

AI development tools, research, and industry news — clustered and ranked by importance.

24h48hWeekMonth
AllFrontier LabsAI Coding ToolsModelsResearchInfrastructureFrameworksNewsCommunityOpen Source
Running local models on Macs gets faster with Ollama's MLX support

Ollama has introduced support for Apple's open-source MLX framework for machine learning, enhancing its runtime system for operating large language models on local computers. This update improves caching performance and adds support for Nvidia's NVFP4 format for model compression, which optimizes memory usage. These enhancements are expected to significantly boost performance on Macs with Apple Silicon chips (M1 or later). The surge in interest for local models, exemplified by OpenClaw's rapid rise to over 300,000 stars on GitHub, underscores the growing trend in using local computing resources for machine learning.

Ars Technica - AI·7h ago
ai-frameworksai-models
llm-all-models-async 0.1

The article discusses the release of 'llm-all-models-async 0.1', a significant new version that promises to enhance the performance and ease of use of various language models in asynchronous environments. This version introduces an improved API that facilitates smoother operations when working with multiple models. Specific performance metrics are not detailed, but the focus is on enabling developers to make better use of language models in their applications. The update is aimed at streamlining workflows in AI-assisted software development.

Simon Willison·9h ago
Simon Willison
ai-modelsai-frameworks
GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

The article introduces GISTBench, a benchmark designed for evaluating Large Language Models' (LLMs) capabilities in understanding user interests within recommendation systems through their interaction histories. It features two new metric families: Interest Groundedness (IG), which includes precision and recall components to penalize hallucination while rewarding coverage, and Interest Specificity (IS), which assesses distinctiveness of LLM-generated user profiles. A synthetic dataset based on real user interactions is released, containing implicit and explicit engagement signals, with validation against user surveys. The evaluation of eight open-weight LLMs, ranging from 7B to 120B parameters, uncovers significant performance bottlenecks, particularly in counting and attributing engagement signals.

arXiv CS.AI·2h ago
ai-researchai-models
SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents

The article presents SciVisAgentBench, a new benchmark for evaluating scientific data analysis and visualization agents, developed in response to the need for a principled evaluation framework in this rapidly evolving field. This benchmark includes 108 expert-crafted cases covering multiple scenarios and is structured around four dimensions: application domain, data type, complexity level, and visualization operation. It introduces a multimodal evaluation pipeline that combines human judgment with various deterministic evaluation methods. A validity study involving 12 SciVis experts was conducted to explore the agreement between human and LLM judges, establishing initial baselines and identifying capability gaps in current SciVis agents.

arXiv CS.AI·2h ago
ai-researchai-models

Latest

  • Visual Studio Code 1.114
    VS Code Blog-659m ago
  • Improve coding agents’ performance with Gemini API Docs MCP and Agent Skills.
    Google Developers Blog-413m ago
  • Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
    arXiv CS.SE2h ago
  • Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping
    arXiv CS.AI2h ago
  • GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
    arXiv CS.AI

Latest

  • Visual Studio Code 1.114
    VS Code Blog-659m ago
  • Improve coding agents’ performance with Gemini API Docs MCP and Agent Skills.
    Google Developers Blog-413m ago
  • Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
    arXiv CS.SE2h ago
  • Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping
    arXiv CS.AI2h ago
  • GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
    arXiv CS.AI
2h ago
  • SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
    arXiv CS.AI2h ago
  • SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
    arXiv CS.CL2h ago
  • Compiling Code LLMs into Lightweight Executables
    arXiv CS.SE2h ago
  • HackRep: A Large-Scale Dataset of GitHub Hackathon Projects
    arXiv CS.SE2h ago
  • Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs
    arXiv CS.CL2h ago
  • From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
    arXiv CS.CL2h ago
  • Practical Feasibility of Sustainable Software Engineering Tools and Techniques
    arXiv CS.SE2h ago
  • ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
    arXiv CS.AI2h ago
  • Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
    arXiv CS.CL2h ago
  • Concept Training for Human-Aligned Language Models
    arXiv CS.CL2h ago
  • BayesInsights: Modelling Software Delivery and Developer Experience with Bayesian Networks at Bloomberg
    arXiv CS.SE2h ago
  • SkillReducer: Optimizing LLM Agent Skills for Token Efficiency
    arXiv CS.SE2h ago
  • Machine Learning in the Wild: Early Evidence of Non-Compliant ML-Automation in Open-Source Software
    arXiv CS.SE2h ago
  • EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback
    arXiv CS.SE2h ago
  • How and Why Agents Can Identify Bug-Introducing Commits
    arXiv CS.SE2h ago
  • Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
    arXiv CS.SE2h ago
  • Sustainable AI Assistance Through Digital Sobriety
    arXiv CS.SE2h ago
  • Software Vulnerability Detection Using a Lightweight Graph Neural Network
    arXiv CS.SE2h ago
  • Designing FSMs Specifications from Requirements with GPT 4.0
    arXiv CS.SE2h ago
  • Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback
    arXiv CS.SE2h ago
  • Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa
    arXiv CS.CL2h ago
  • CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking
    arXiv CS.CL2h ago
  • SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali
    arXiv CS.CL2h ago
  • MemRerank: Preference Memory for Personalized Product Reranking
    arXiv CS.CL2h ago
  • The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
    arXiv CS.CL2h ago
  • 2h ago
  • SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents
    arXiv CS.AI2h ago
  • SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation
    arXiv CS.CL2h ago
  • Compiling Code LLMs into Lightweight Executables
    arXiv CS.SE2h ago
  • HackRep: A Large-Scale Dataset of GitHub Hackathon Projects
    arXiv CS.SE2h ago
  • Dual Perspectives in Emotion Attribution: A Generator-Interpreter Framework for Cross-Cultural Analysis of Emotion in LLMs
    arXiv CS.CL2h ago
  • From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
    arXiv CS.CL2h ago
  • Practical Feasibility of Sustainable Software Engineering Tools and Techniques
    arXiv CS.SE2h ago
  • ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
    arXiv CS.AI2h ago
  • Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
    arXiv CS.CL2h ago
  • Concept Training for Human-Aligned Language Models
    arXiv CS.CL2h ago
  • BayesInsights: Modelling Software Delivery and Developer Experience with Bayesian Networks at Bloomberg
    arXiv CS.SE2h ago
  • SkillReducer: Optimizing LLM Agent Skills for Token Efficiency
    arXiv CS.SE2h ago
  • Machine Learning in the Wild: Early Evidence of Non-Compliant ML-Automation in Open-Source Software
    arXiv CS.SE2h ago
  • EcoScratch: Cost-Effective Multimodal Repair for Scratch Using Execution Feedback
    arXiv CS.SE2h ago
  • How and Why Agents Can Identify Bug-Introducing Commits
    arXiv CS.SE2h ago
  • Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus
    arXiv CS.SE2h ago
  • Sustainable AI Assistance Through Digital Sobriety
    arXiv CS.SE2h ago
  • Software Vulnerability Detection Using a Lightweight Graph Neural Network
    arXiv CS.SE2h ago
  • Designing FSMs Specifications from Requirements with GPT 4.0
    arXiv CS.SE2h ago
  • Logging Like Humans for LLMs: Rethinking Logging via Execution and Runtime Feedback
    arXiv CS.SE2h ago
  • Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa
    arXiv CS.CL2h ago
  • CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking
    arXiv CS.CL2h ago
  • SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali
    arXiv CS.CL2h ago
  • MemRerank: Preference Memory for Personalized Product Reranking
    arXiv CS.CL2h ago
  • The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
    arXiv CS.CL2h ago