World models are eating the AI world. NVIDIA’s Cosmos, OpenAI’s Sora, Google’s Genie 2, and a dozen startups you’ve never heard of are all racing to build AI systems that understand physical reality. Not just language. Not just images. The actual world. Gravity, friction, human movement, object permanence, cause and effect.

The architecture papers get all the attention. Transformer variants, diffusion heads, video tokenizers, latent space representations. But here’s what the papers don’t emphasize: Cosmos was trained on 20 million hours of video. Sora reportedly used hundreds of millions of video clips. Genie 2 scraped gameplay footage from thousands of sources.

The model architecture is the sexy part. The data pipeline is the part that determines whether your project lives or dies.

And that pipeline has a problem that no amount of model innovation can solve: the platforms that host this video data do not want you downloading it at scale.

Why World Models Need Video Data at a Different Scale

Large language models trained on text. Billions of tokens, but text is compact. A book is a few megabytes. The entire Wikipedia is under 100GB compressed.

World models train on video. A single hour of 720p video is roughly 1GB. Ten thousand hours is 10TB. A million hours is a petabyte. Twenty million hours, like Cosmos, is 20 petabytes of raw video before any preprocessing, filtering, or encoding.

But it’s not just about volume. It’s about diversity:

Data Dimension	Why It Matters	Source Examples
Geography	A world model trained only on US suburban footage fails in Mumbai traffic or Tokyo crowds	YouTube vlogs by country, local news broadcasts, regional social platforms
Camera perspective	Egocentric (first-person) vs. exocentric (third-person) capture completely different physical relationships	GoPro footage, smartphone videos, security cameras, drone footage
Temporal dynamics	Physics understanding requires seeing cause and effect over time	Action sequences, sports, industrial processes, cooking
Domain coverage	Indoor vs. outdoor, natural vs. synthetic, human vs. machine interaction	Home videos, construction sites, factory floors, driving footage
Multimodal grounding	Video must align with text descriptions, audio, or sensor data	Captioned videos, narrated tutorials, ASMR content

No single dataset provides this. ImageNet is static images. Kinetics is short action clips. Ego4D is egocentric but limited scope. You need to build your own collection pipeline from the open web.

And the open web is protected by the most sophisticated anti-bot systems ever deployed.

The Anti-Bot Wall

YouTube, TikTok, Instagram, Twitter/X, Twitch, Bilibili, Reddit. These platforms collectively host billions of hours of video. They also collectively employ machine learning systems specifically designed to detect and block automated downloading.

The detection mechanisms operate across multiple layers:

Network layer: IP reputation scoring. Is this IP from a known datacenter range? Has it made suspicious request patterns? Is it on a blocklist?

Transport layer: TLS fingerprinting. Every HTTP client library has a unique TLS handshake signature. Real browsers look different from Python requests. Real browsers look different from headless Chrome. And the platforms know every signature.

Application layer: Request timing analysis. Humans don’t download videos at perfectly regular intervals. Humans don’t request 100 video metadata pages in 60 seconds. Humans scroll, pause, click, go back.

Behavioral layer: JavaScript execution, cookie persistence, session continuity, mouse movement patterns, scroll depth. The platforms run fingerprinting scripts that collect hundreds of signals and feed them into classifiers trained on billions of human sessions.

The result: A naive Python script using requests.get() with a single IP address will be blocked within minutes. A headless browser from a cloud server will be detected within hours. A datacenter proxy pool will last days before the entire IP range is flagged.

You need infrastructure that looks like the actual users who uploaded and watch this content. You need residential proxies.

What Residential Proxies Actually Do

A residential proxy routes your requests through IP addresses assigned to real households by real internet service providers. Verizon in Brooklyn. Comcast in Phoenix. BT in London. Deutsche Telekom in Berlin. NTT in Tokyo.

To the platform’s anti-bot systems, this traffic is indistinguishable from a human user watching videos on their home WiFi. Because that’s exactly what it is.

The key capabilities for world model data collection:

Rotation control. Per-request rotation for metadata scraping where you need maximum distribution. Sticky sessions for 10-30 minutes when you’re downloading a multi-part video or maintaining authentication state.

Geographic precision. City-level targeting to collect videos from specific regions. A world model that needs to understand driving in Bangalore should collect videos from Bangalore, not from a generic “Asia” IP pool.

Scale without patterns. 50 million IPs means you can make millions of requests without ever repeating an IP address. No pattern to detect. No fingerprint to build.

Session persistence. Some data sources require login, multi-step navigation, or maintaining state across requests. Sticky sessions let you keep the same IP for a logical session without losing the distribution benefits.

ThorData’s residential proxy network provides exactly this infrastructure. 50 million IPs across 195 countries, with rotation granularity down to individual requests and session stickiness up to 30 minutes. Sub-second latency for real-time pipeline integration.

Architecture: A Production World Model Data Pipeline

Here’s how a serious world model training pipeline looks in practice. Not a toy script. A system designed to collect and process petabytes of video.

Orchestrator (Kubernetes / Airflow)
    │
    ├── Discovery Workers (100-1000 concurrent)
    │       │
    │       ├── SERP API queries (Google Video, YouTube search)
    │       ├── Platform-specific API calls (where available)
    │       └── Social media monitoring (trending hashtags, viral content)
    │       │
    │       └── ThorData Residential Proxy (per-request rotation)
    │               │
    │               └── Target Platforms (YouTube, TikTok, Instagram, etc.)
    │
    ├── Metadata Store (PostgreSQL / ClickHouse)
    │       │
    │       └── Deduplication, quality scoring, licensing checks
    │
    ├── Download Workers (50-200 concurrent)
    │       │
    │       ├── yt-dlp / ffmpeg pipelines
    │       ├── Format normalization (MP4, resolution standardization)
    │       └── ThorData Residential Proxy (sticky sessions for large files)
    │               │
    │               └── Video CDN endpoints
    │
    ├── Preprocessing Cluster (Spark / Ray)
    │       │
    │       ├── Scene detection and segmentation
    │       ├── Optical flow extraction
    │       ├── Audio separation and transcription
    │       └── Text alignment (captions, comments, descriptions)
    │
    └── Training Store (S3 / GCS with lifecycle policies)
            │
            └── Hot: Recent data for active training
            └── Warm: Processed datasets for experiments
            └── Cold: Raw archives for compliance and reprocessing

The proxy layer isn’t an afterthought. It’s the critical path. Every request to every platform goes through it. If the proxy layer fails, the entire pipeline stalls.

Code: Discovery with Geographic Targeting

import requests
import json
from datetime import datetime
from urllib.parse import quote_plus

THORDATA_PROXY = "http://user:pass@gate.thordata.com:10000"

class WorldModelDataDiscovery:
    """
    Discover training videos with geographic and demographic targeting.
    Critical for world models that need diverse physical environments.
    """
    
    GEO_TARGETS = {
        "urban_driving": ["us", "de", "jp", "cn"],
        "rural_environments": ["br", "in", "ke", "au"],
        "dense_crowds": ["in", "bd", "cn", "id"],
        "industrial_processes": ["de", "cn", "us", "kr"],
        "domestic_indoor": ["us", "gb", "jp", "fr"]
    }
    
    def __init__(self):
        self.session = requests.Session()

        self.discovered = [ ]

    
    def search_with_geo(self, query, scenario_type, max_results=50):
        """
        Execute the same search from multiple geographic perspectives.
        YouTube's search results are personalized by region.
        """
        countries = self.GEO_TARGETS.get(scenario_type, ["us"])

        all_results = [ ]

        
        for country in countries:
            proxy = f"{THORDATA_PROXY}&country={country}"
            
            # YouTube search via Invidious or similar API
            # Or Google Video SERP API
            results = self._execute_search(query, proxy, country, max_results)
            all_results.extend(results)
            
            print(f"Collected {len(results)} from {country}")
        
        # Deduplicate by video ID
        unique = {r["video_id"]: r for r in all_results}
        return list(unique.values())
    
    def _execute_search(self, query, proxy, country, max_results):
        """
        Execute search through residential proxy from specific country.
        """
        self.session.proxies = {"http": proxy, "https": proxy}
        
        # Using a SERP API or direct platform API
        # Example with SerpApi Google Video search
        params = {
            "engine": "google",
            "q": f"{query} video",
            "tbm": "vid",
            "num": max_results,
            "gl": country,  # Geographic location parameter
            "api_key": "your_serp_api_key"
        }
        
        try:
            response = self.session.get(
                "https://serpapi.com/search",
                params=params,
                timeout=30
            )
            response.raise_for_status()
            
            return self._parse_video_results(response.json(), country)
            
        except requests.exceptions.RequestException as e:
            print(f"Search failed for {country}: {e}")

            return [ ]

    
    def _parse_video_results(self, data, country):
        """Extract structured metadata from search results."""

        videos = [ ]

        

        for result in data.get("video_results", [ ]):

            video = {
                "video_id": self._extract_id(result.get("link", "")),
                "title": result.get("title", ""),
                "url": result.get("link", ""),
                "duration": self._parse_duration(result.get("duration", "0:00")),
                "thumbnail": result.get("thumbnail", ""),
                "source_country": country,
                "platform": self._detect_platform(result.get("link", "")),
                "discovered_at": datetime.utcnow().isoformat(),
                "query_match_score": self._relevance_score(result)
            }
            videos.append(video)
        
        return videos
    
    def _extract_id(self, url):
        """Extract platform-specific video ID."""
        if "youtube.com" in url or "youtu.be" in url:
            # Extract YouTube ID
            pass
        return url  # Fallback
    
    def _parse_duration(self, duration_str):
        """Convert duration to seconds."""
        parts = duration_str.split(":")
        if len(parts) == 2:
            return int(parts[0]) * 60 + int(parts[1])
        elif len(parts) == 3:
            return int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
        return 0
    
    def _detect_platform(self, url):
        url_lower = url.lower()
        if "youtube" in url_lower:
            return "youtube"
        elif "tiktok" in url_lower:
            return "tiktok"
        elif "instagram" in url_lower:
            return "instagram"
        elif "twitter" in url_lower or "x.com" in url_lower:
            return "twitter"
        return "unknown"
    
    def _relevance_score(self, result):
        """
        Score how likely this video is useful for world model training.
        Prefer longer videos with clear physical action.
        """
        score = 0
        title = result.get("title", "").lower()
        
        # Boost for action-oriented content
        action_keywords = ["driving", "walking", "cooking", "playing", 
                          "construction", "factory", "sports", "tutorial"]
        for kw in action_keywords:
            if kw in title:
                score += 10
        
        # Boost for first-person indicators
        if any(x in title for x in ["pov", "gopro", "first person", "walking around"]):
            score += 15
        
        return score

# Usage: Collect diverse urban driving footage
discovery = WorldModelDataDiscovery()
videos = discovery.search_with_geo(
    query="driving through city traffic",
    scenario_type="urban_driving",
    max_results=100
)
print(f"Discovered {len(videos)} unique videos across {len(set(v['source_country'] for v in videos))} countries")

Code: Download with Session Management

import yt_dlp
import os
from datetime import datetime

class WorldModelVideoDownloader:
    """
    Download videos with sticky sessions for large files
    and automatic quality selection for training efficiency.
    """
    
    def __init__(self, output_base="./training_data"):
        self.output_base = output_base
        os.makedirs(output_base, exist_ok=True)
        self.download_stats = {
            "attempted": 0,
            "successful": 0,
            "failed": 0,
            "bytes_downloaded": 0
        }
    
    def download_for_training(self, video_metadata, quality="720"):
        """
        Download video optimized for world model training.
        720p provides sufficient detail without excessive storage.
        """
        video_id = video_metadata["video_id"]
        scenario = video_metadata.get("scenario_type", "general")
        country = video_metadata.get("source_country", "unknown")
        
        # Organize by scenario and geography for balanced sampling
        output_dir = os.path.join(
            self.output_base,
            scenario,
            country,
            datetime.now().strftime("%Y-%m")
        )
        os.makedirs(output_dir, exist_ok=True)
        
        # Sticky session for this download
        # Large files may take minutes; session persistence prevents mid-download blocks
        sticky_proxy = f"{THORDATA_PROXY}&session=dl_{video_id[:8]}"
        
        ydl_opts = {
            'format': f'best[height<={quality}]',
            'proxy': sticky_proxy,
            'outtmpl': os.path.join(output_dir, '%(id)s_%(title)s.%(ext)s'),
            
            # Metadata extraction for training alignment
            'writethumbnail': True,
            'writeinfojson': True,
            'writesubtitles': True,
            'writeautomaticsub': True,
            
            # Post-processing for training format
            'postprocessors': [
                {
                    'key': 'FFmpegVideoConvertor',
                    'preferedformat': 'mp4',
                }
            ],
            
            # Reliability
            'retries': 5,
            'fragment_retries': 5,
            'skip_unavailable_fragments': True,
            
            'quiet': True,
        }
        
        self.download_stats["attempted"] += 1
        
        try:
            with yt_dlp.YoutubeDL(ydl_opts) as ydl:
                info = ydl.extract_info(video_metadata["url"], download=True)
                
                filepath = ydl.prepare_filename(info)
                file_size = os.path.getsize(filepath) if os.path.exists(filepath) else 0
                
                self.download_stats["successful"] += 1
                self.download_stats["bytes_downloaded"] += file_size
                
                return {
                    "success": True,
                    "filepath": filepath,
                    "metadata": {
                        "duration": info.get("duration"),
                        "resolution": info.get("resolution"),
                        "fps": info.get("fps"),
                        "file_size": file_size,
                        "source": video_metadata
                    }
                }
                
        except Exception as e:
            self.download_stats["failed"] += 1
            return {
                "success": False,
                "error": str(e),
                "video_id": video_id
            }
    
    def get_stats(self):
        """Return download statistics for pipeline monitoring."""
        total = self.download_stats["attempted"]
        if total == 0:
            return self.download_stats
        
        return {
            **self.download_stats,
            "success_rate": self.download_stats["successful"] / total,
            "average_size": (self.download_stats["bytes_downloaded"] / 
                           max(self.download_stats["successful"], 1))
        }

The Geographic Diversity Problem

Here’s a concrete example of why this matters. NVIDIA’s Cosmos paper emphasizes that their training data includes “real-world environments, human interactions, and physical dynamics.”

But consider: a world model trained primarily on US and European video will struggle with:

Driving on the left side of the road (UK, Japan, Australia, India)
Dense motorcycle traffic in Hanoi or Jakarta
Unpaved roads in rural Africa or South America
Snow and ice driving in Scandinavia vs. monsoon conditions in Mumbai
Indoor layouts common in Japanese apartments vs. American suburban homes

Without geographic targeting in your data collection, you’re building a world model that understands a narrow slice of the world. Residential proxies with country and city-level targeting solve this by letting you collect from the actual regions you need to represent.

ThorData’s geographic targeting goes down to the city level across 195 countries. Need videos from São Paulo specifically, not just “Brazil”? That’s possible. Need first-person footage from Tokyo’s Shibuya crossing? Targetable.

Quality Control: Not All Video Is Equal

Raw collection is just the start. For world model training, you need filtering:

Filter	Purpose	Implementation
Motion blur detection	Blurry frames teach nothing about physics	Laplacian variance threshold
Static scene removal	No temporal dynamics = no causal learning	Frame difference analysis
Text overlay detection	Watermarks and subtitles contaminate visual learning	OCR-based filtering
Resolution minimum	Too low-res loses fine physical details	480p floor, 720p preferred
Duration sweet spot	Too short lacks context; too long is inefficient	30 seconds to 5 minutes
Audio-visual alignment	Mismatched audio and video breaks multimodal learning	Sync detection algorithms

These filters run after download but before entering your training store. The pipeline architecture matters because you’re processing 100x the data you’ll actually keep.

Cost Modeling: From Experiment to Production

Stage	Videos/Month	Proxy Traffic	Storage	Total Monthly Cost	Timeline
Research prototype	10K	500 GB	2 TB	$800	Months 1-3
Pilot model	100K	5 TB	20 TB	$3,500	Months 4-6
Production v1	1M	50 TB	200 TB	$15,000	Months 7-12
Scale model	10M	500 TB	2 PB	$80,000	Year 2

Proxy costs are typically 15-20% of total infrastructure. The alternative—getting blocked, missing data, building biased models—is far more expensive in lost training time and model performance.

Conclusion

World model training is the next frontier of AI. The architectures are public. The compute is available. The differentiator is data. Not just volume, but diversity, quality, and geographic coverage.

The platforms that host this data are not your friends. They are businesses with their own priorities, and they will block you without hesitation. Residential proxies are not a workaround. They are the infrastructure layer that makes ethical, large-scale data collection possible.

Your world model is only as good as the world it has seen. Make sure it sees the whole world.

Start building your world model data pipeline with ThorData residential proxies