AI & VideoTechnical Development

PhiloLabs benchmark shows AI agents at 31% on video post-production

PhiloLabs introduced AgenticVBench, a new benchmark for evaluating AI agents in real-world video post-production tasks. The benchmark, developed with 20 industry experts, assesses AI models across four task families: assembly, repair, sequencing, and repurpose, revealing that the best AI agent achieved only 31% accuracy compared to human experts' 89%. The study also highlights that the "harness" (scaffolding around the model) significantly impacts agent performance, sometimes as much as the model itself.

Key Takeaways

AgenticVBench evaluates 7 frontier models on 100 expert-authored tasks across 4 video post-production task families.
The top score came from GPT-5.5 · Codex at 31.0% ± 4.0, far below human experts at 88.5% average.
The four task families are assembly, repair, sequencing, and repurpose, with human scores ranging from 81% to 95%.
A harness swap changed GPT-5.5’s Assembly score by 20 percentage points, from 18% with OpenClaw to 38% with Codex.
The benchmark tasks were written by 20 industry experts averaging 6 years of post-production experience.

Why It Matters

AgenticVBench shows that today’s best AI agents still lag far behind human editors on real post-production work, with the top model at 31% versus 89% for experts. It also makes the evaluation stack more complicated: the same GPT-5.5 model moved 20 points on Assembly depending on the harness, so model-only leaderboards miss a major part of performance. For the streaming video workflow, that means task-level benchmarking now has to account for both model quality and scaffolding. The next signal to watch is whether future leaderboard updates narrow the 31% vs. 88.5% gap on the four named task families.

Read full article at agenticvbench.com

Get this in your inbox → Subscribe

Enjoy our coverage?

Add StreamingMeme as a preferred source on Google to see more of our streaming news at the top of your Search results.

Add as preferred source

Qiang Zhang: DeltaToken cuts video tokens from 180K to under 1,000

ayushchat: Whisper runs locally on Apple Silicon with no network access

South China Morning Post: ByteDance’s Seedance 2.0 can generate feature-length films

Arxiv: Framework cuts video bandwidth requirements by 99% using generative AI

PhiloLabs benchmark shows AI agents at 31% on video post-production

Key Takeaways

AgenticVBench evaluates 7 frontier models on 100 expert-authored tasks across 4 video post-production task families.
The top score came from GPT-5.5 · Codex at 31.0% ± 4.0, far below human experts at 88.5% average.
The four task families are assembly, repair, sequencing, and repurpose, with human scores ranging from 81% to 95%.
A harness swap changed GPT-5.5’s Assembly score by 20 percentage points, from 18% with OpenClaw to 38% with Codex.
The benchmark tasks were written by 20 industry experts averaging 6 years of post-production experience.

Why It Matters

Read full article at agenticvbench.com

PhiloLabs benchmark shows AI agents at 31% on video post-production

Key Takeaways

Why It Matters

Enjoy our coverage?

Related Articles

PhiloLabs benchmark shows AI agents at 31% on video post-production

Key Takeaways

Why It Matters

Enjoy our coverage?

Related Articles

Newest

Upcoming Events

Top Sources

Newest

Upcoming Events

Top Sources

Related Articles

DeltaToken cuts video tokens from 180K to under 1,000

Whisper runs locally on Apple Silicon with no network access

ByteDance’s Seedance 2.0 can generate feature-length films

Framework cuts video bandwidth requirements by 99% using generative AI