PhiloLabs benchmark shows AI agents at 31% on video post-production
PhiloLabs introduced AgenticVBench, a new benchmark for evaluating AI agents in real-world video post-production tasks. The benchmark, developed with 20 industry experts, assesses AI models across four task families: assembly, repair, sequencing, and repurpose, revealing that the best AI agent achieved only 31% accuracy compared to human experts' 89%. The study also highlights that the "harness" (scaffolding around the model) significantly impacts agent performance, sometimes as much as the model itself.
Key Takeaways
- AgenticVBench evaluates 7 frontier models on 100 expert-authored tasks across 4 video post-production task families.
- The top score came from GPT-5.5 · Codex at 31.0% ± 4.0, far below human experts at 88.5% average.
- The four task families are assembly, repair, sequencing, and repurpose, with human scores ranging from 81% to 95%.
- A harness swap changed GPT-5.5’s Assembly score by 20 percentage points, from 18% with OpenClaw to 38% with Codex.
- The benchmark tasks were written by 20 industry experts averaging 6 years of post-production experience.
Why It Matters
AgenticVBench shows that today’s best AI agents still lag far behind human editors on real post-production work, with the top model at 31% versus 89% for experts. It also makes the evaluation stack more complicated: the same GPT-5.5 model moved 20 points on Assembly depending on the harness, so model-only leaderboards miss a major part of performance. For the streaming video workflow, that means task-level benchmarking now has to account for both model quality and scaffolding. The next signal to watch is whether future leaderboard updates narrow the 31% vs. 88.5% gap on the four named task families.
Read full article at agenticvbench.com
