SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request FeedbackThe article introduces SWE-PRBench, a benchmark consisting of 350 pull requests used to evaluate AI code review quality. The assessments reveal that eight advanced models only detect 15-31% of issues flagged by human reviewers, indicating that AI code review is significantly less effective than human performance. The study examined three configurations for context provision, finding that models consistently underperformed when context increased. Notably, it was found that the best-performing models achieved mean scores between 0.147 to 0.153, while a clear gap was observed with the remaining models, which scored 0.113 or lower. The dataset and framework used for evaluation are publicly accessible.