TikTok’s MLT-Dedup cuts repetition 91% with 5x larger index
TikTok researchers have developed MLT-Dedup, a new framework for efficient large-scale online video deduplication. This system uses multi-level video representations (ML-VE) for scaled candidate retrieval and a differential feature-enhanced similarity module (DiF-SiM) for precise spatial-temporal matching. Online A/B tests demonstrate that MLT-Dedup reduces repetition rates by 91% at 90% precision, and its sparse retrieval design increases index size by five times.
Key Takeaways
- MLT-Dedup uses ML-VE to generate both clip-level embeddings for retrieval and frame-level embeddings for matching.
- Online A/B tests reported a 91% reduction in repetition rate at 90% precision for the full ML-VE + DiF-SiM stack.
- The sparse retrieval design increased retrieval index size by 5x, allowing broader candidate coverage under fixed resources.
- DiF-SiM adds differential features and learned similarity to localize duplicated temporal segments before making deduplication decisions.
- On the VCSL benchmark, DiF-SiM reached a 74.31 F-score, ahead of RTR + pre-training at 70.73.
Why It Matters
The immediate effect is practical: MLT-Dedup lowers duplicate-video repetition while storing more content in the retrieval index, which matters when dedup systems operate under tight memory budgets. The broader point is architectural: TikTok is not relying on denser embeddings alone, but splitting retrieval and verification across clip-level and frame-level representations, then using temporal overlap thresholds to avoid false matches on partial copies. For streaming platforms, that’s a concrete template for large-scale content filtering. What to watch next is whether the same ML-VE and DiF-SiM split holds up as index TTL grows and candidate pools get larger in production.
Read full article at openreview.net