Fetch real-time data from 100+ websites,No development or maintenance required.
Over 100 million real residential IPs from genuine users across 190+ countries.
SCRAPING SOLUTIONS
Get accurate and in real-time results sourced from Google, Bing, and more.
With 120+ prebuilt and custom scrapers ready for any use case.
No blocks, no CAPTCHAs—unlock websites seamlessly at scale.
Execute scripts in stealth browsers with full rendering and automation
PROXY INFRASTRUCTURE
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
SCRAPING SOLUTIONS
PROXY INFRASTRUCTURE
DATA FEEDS
Full details on all features, parameters, and integrations, with code samples in every major language.
LEARNING HUB
ALL LOCATIONS Proxy Locations
TOOLS
RESELLER
Get up to 50%
Contact sales:partner@thordata.com
Products $/GB
Fetch real-time data from 100+ websites,No development or maintenance required.
Get real-time results from search engines. Only pay for successful responses.
Execute scripts in stealth browsers with full rendering and automation.
Bid farewell to CAPTCHAs and anti-scraping, scrape public sites effortlessly.
Dataset Marketplace Pre-collected data from 100+ domains.
Over 100 million real residential IPs from genuine users across 190+ countries.
Reliable mobile data extraction, powered by real 4G/5G mobile IPs.
For time-sensitive tasks, utilize residential IPs with unlimited bandwidth.
Fast and cost-efficient IPs optimized for large-scale scraping.
Data for AI $/GB
Pricing $0/GB
Docs $/GB
Full details on all features, parameters, and integrations, with code samples in every major language.
Resource $/GB
EN $/GB
产品 $/GB
AI数据 $/GB
定价 $0/GB
产品文档 $/GB
资源 $/GB
简体中文 $/GB
Blog
Residential Proxiestraining-world-models-at-scale-how-residential-proxies-enable-petabyte-scale-video-data-collection

World models are eating the AI world. NVIDIA’s Cosmos, OpenAI’s Sora, Google’s Genie 2, and a dozen startups you’ve never heard of are all racing to build AI systems that understand physical reality. Not just language. Not just images. The actual world. Gravity, friction, human movement, object permanence, cause and effect.
The architecture papers get all the attention. Transformer variants, diffusion heads, video tokenizers, latent space representations. But here’s what the papers don’t emphasize: Cosmos was trained on 20 million hours of video. Sora reportedly used hundreds of millions of video clips. Genie 2 scraped gameplay footage from thousands of sources.
The model architecture is the sexy part. The data pipeline is the part that determines whether your project lives or dies.
And that pipeline has a problem that no amount of model innovation can solve: the platforms that host this video data do not want you downloading it at scale.
Large language models trained on text. Billions of tokens, but text is compact. A book is a few megabytes. The entire Wikipedia is under 100GB compressed.
World models train on video. A single hour of 720p video is roughly 1GB. Ten thousand hours is 10TB. A million hours is a petabyte. Twenty million hours, like Cosmos, is 20 petabytes of raw video before any preprocessing, filtering, or encoding.
But it’s not just about volume. It’s about diversity:
| Data Dimension | Why It Matters | Source Examples |
| Geography | A world model trained only on US suburban footage fails in Mumbai traffic or Tokyo crowds | YouTube vlogs by country, local news broadcasts, regional social platforms |
| Camera perspective | Egocentric (first-person) vs. exocentric (third-person) capture completely different physical relationships | GoPro footage, smartphone videos, security cameras, drone footage |
| Temporal dynamics | Physics understanding requires seeing cause and effect over time | Action sequences, sports, industrial processes, cooking |
| Domain coverage | Indoor vs. outdoor, natural vs. synthetic, human vs. machine interaction | Home videos, construction sites, factory floors, driving footage |
| Multimodal grounding | Video must align with text descriptions, audio, or sensor data | Captioned videos, narrated tutorials, ASMR content |
No single dataset provides this. ImageNet is static images. Kinetics is short action clips. Ego4D is egocentric but limited scope. You need to build your own collection pipeline from the open web.
And the open web is protected by the most sophisticated anti-bot systems ever deployed.
YouTube, TikTok, Instagram, Twitter/X, Twitch, Bilibili, Reddit. These platforms collectively host billions of hours of video. They also collectively employ machine learning systems specifically designed to detect and block automated downloading.
The detection mechanisms operate across multiple layers:
Network layer: IP reputation scoring. Is this IP from a known datacenter range? Has it made suspicious request patterns? Is it on a blocklist?
Transport layer: TLS fingerprinting. Every HTTP client library has a unique TLS handshake signature. Real browsers look different from Python requests. Real browsers look different from headless Chrome. And the platforms know every signature.
Application layer: Request timing analysis. Humans don’t download videos at perfectly regular intervals. Humans don’t request 100 video metadata pages in 60 seconds. Humans scroll, pause, click, go back.
Behavioral layer: JavaScript execution, cookie persistence, session continuity, mouse movement patterns, scroll depth. The platforms run fingerprinting scripts that collect hundreds of signals and feed them into classifiers trained on billions of human sessions.
The result: A naive Python script using requests.get() with a single IP address will be blocked within minutes. A headless browser from a cloud server will be detected within hours. A datacenter proxy pool will last days before the entire IP range is flagged.
You need infrastructure that looks like the actual users who uploaded and watch this content. You need residential proxies.
A residential proxy routes your requests through IP addresses assigned to real households by real internet service providers. Verizon in Brooklyn. Comcast in Phoenix. BT in London. Deutsche Telekom in Berlin. NTT in Tokyo.
To the platform’s anti-bot systems, this traffic is indistinguishable from a human user watching videos on their home WiFi. Because that’s exactly what it is.
The key capabilities for world model data collection:
Rotation control. Per-request rotation for metadata scraping where you need maximum distribution. Sticky sessions for 10-30 minutes when you’re downloading a multi-part video or maintaining authentication state.
Geographic precision. City-level targeting to collect videos from specific regions. A world model that needs to understand driving in Bangalore should collect videos from Bangalore, not from a generic “Asia” IP pool.
Scale without patterns. 50 million IPs means you can make millions of requests without ever repeating an IP address. No pattern to detect. No fingerprint to build.
Session persistence. Some data sources require login, multi-step navigation, or maintaining state across requests. Sticky sessions let you keep the same IP for a logical session without losing the distribution benefits.
ThorData’s residential proxy network provides exactly this infrastructure. 50 million IPs across 195 countries, with rotation granularity down to individual requests and session stickiness up to 30 minutes. Sub-second latency for real-time pipeline integration.
Here’s how a serious world model training pipeline looks in practice. Not a toy script. A system designed to collect and process petabytes of video.
Orchestrator (Kubernetes / Airflow)
│
├── Discovery Workers (100-1000 concurrent)
│ │
│ ├── SERP API queries (Google Video, YouTube search)
│ ├── Platform-specific API calls (where available)
│ └── Social media monitoring (trending hashtags, viral content)
│ │
│ └── ThorData Residential Proxy (per-request rotation)
│ │
│ └── Target Platforms (YouTube, TikTok, Instagram, etc.)
│
├── Metadata Store (PostgreSQL / ClickHouse)
│ │
│ └── Deduplication, quality scoring, licensing checks
│
├── Download Workers (50-200 concurrent)
│ │
│ ├── yt-dlp / ffmpeg pipelines
│ ├── Format normalization (MP4, resolution standardization)
│ └── ThorData Residential Proxy (sticky sessions for large files)
│ │
│ └── Video CDN endpoints
│
├── Preprocessing Cluster (Spark / Ray)
│ │
│ ├── Scene detection and segmentation
│ ├── Optical flow extraction
│ ├── Audio separation and transcription
│ └── Text alignment (captions, comments, descriptions)
│
└── Training Store (S3 / GCS with lifecycle policies)
│
└── Hot: Recent data for active training
└── Warm: Processed datasets for experiments
└── Cold: Raw archives for compliance and reprocessing
The proxy layer isn’t an afterthought. It’s the critical path. Every request to every platform goes through it. If the proxy layer fails, the entire pipeline stalls.
import requests
import json
from datetime import datetime
from urllib.parse import quote_plus
THORDATA_PROXY = "http://user:pass@gate.thordata.com:10000"
class WorldModelDataDiscovery:
"""
Discover training videos with geographic and demographic targeting.
Critical for world models that need diverse physical environments.
"""
GEO_TARGETS = {
"urban_driving": ["us", "de", "jp", "cn"],
"rural_environments": ["br", "in", "ke", "au"],
"dense_crowds": ["in", "bd", "cn", "id"],
"industrial_processes": ["de", "cn", "us", "kr"],
"domestic_indoor": ["us", "gb", "jp", "fr"]
}
def __init__(self):
self.session = requests.Session()
self.discovered = [ ]
def search_with_geo(self, query, scenario_type, max_results=50):
"""
Execute the same search from multiple geographic perspectives.
YouTube's search results are personalized by region.
"""
countries = self.GEO_TARGETS.get(scenario_type, ["us"])
all_results = [ ]
for country in countries:
proxy = f"{THORDATA_PROXY}&country={country}"
# YouTube search via Invidious or similar API
# Or Google Video SERP API
results = self._execute_search(query, proxy, country, max_results)
all_results.extend(results)
print(f"Collected {len(results)} from {country}")
# Deduplicate by video ID
unique = {r["video_id"]: r for r in all_results}
return list(unique.values())
def _execute_search(self, query, proxy, country, max_results):
"""
Execute search through residential proxy from specific country.
"""
self.session.proxies = {"http": proxy, "https": proxy}
# Using a SERP API or direct platform API
# Example with SerpApi Google Video search
params = {
"engine": "google",
"q": f"{query} video",
"tbm": "vid",
"num": max_results,
"gl": country, # Geographic location parameter
"api_key": "your_serp_api_key"
}
try:
response = self.session.get(
"https://serpapi.com/search",
params=params,
timeout=30
)
response.raise_for_status()
return self._parse_video_results(response.json(), country)
except requests.exceptions.RequestException as e:
print(f"Search failed for {country}: {e}")
return [ ]
def _parse_video_results(self, data, country):
"""Extract structured metadata from search results."""
videos = [ ]
for result in data.get("video_results", [ ]):
video = {
"video_id": self._extract_id(result.get("link", "")),
"title": result.get("title", ""),
"url": result.get("link", ""),
"duration": self._parse_duration(result.get("duration", "0:00")),
"thumbnail": result.get("thumbnail", ""),
"source_country": country,
"platform": self._detect_platform(result.get("link", "")),
"discovered_at": datetime.utcnow().isoformat(),
"query_match_score": self._relevance_score(result)
}
videos.append(video)
return videos
def _extract_id(self, url):
"""Extract platform-specific video ID."""
if "youtube.com" in url or "youtu.be" in url:
# Extract YouTube ID
pass
return url # Fallback
def _parse_duration(self, duration_str):
"""Convert duration to seconds."""
parts = duration_str.split(":")
if len(parts) == 2:
return int(parts[0]) * 60 + int(parts[1])
elif len(parts) == 3:
return int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
return 0
def _detect_platform(self, url):
url_lower = url.lower()
if "youtube" in url_lower:
return "youtube"
elif "tiktok" in url_lower:
return "tiktok"
elif "instagram" in url_lower:
return "instagram"
elif "twitter" in url_lower or "x.com" in url_lower:
return "twitter"
return "unknown"
def _relevance_score(self, result):
"""
Score how likely this video is useful for world model training.
Prefer longer videos with clear physical action.
"""
score = 0
title = result.get("title", "").lower()
# Boost for action-oriented content
action_keywords = ["driving", "walking", "cooking", "playing",
"construction", "factory", "sports", "tutorial"]
for kw in action_keywords:
if kw in title:
score += 10
# Boost for first-person indicators
if any(x in title for x in ["pov", "gopro", "first person", "walking around"]):
score += 15
return score
# Usage: Collect diverse urban driving footage
discovery = WorldModelDataDiscovery()
videos = discovery.search_with_geo(
query="driving through city traffic",
scenario_type="urban_driving",
max_results=100
)
print(f"Discovered {len(videos)} unique videos across {len(set(v['source_country'] for v in videos))} countries")
import yt_dlp
import os
from datetime import datetime
class WorldModelVideoDownloader:
"""
Download videos with sticky sessions for large files
and automatic quality selection for training efficiency.
"""
def __init__(self, output_base="./training_data"):
self.output_base = output_base
os.makedirs(output_base, exist_ok=True)
self.download_stats = {
"attempted": 0,
"successful": 0,
"failed": 0,
"bytes_downloaded": 0
}
def download_for_training(self, video_metadata, quality="720"):
"""
Download video optimized for world model training.
720p provides sufficient detail without excessive storage.
"""
video_id = video_metadata["video_id"]
scenario = video_metadata.get("scenario_type", "general")
country = video_metadata.get("source_country", "unknown")
# Organize by scenario and geography for balanced sampling
output_dir = os.path.join(
self.output_base,
scenario,
country,
datetime.now().strftime("%Y-%m")
)
os.makedirs(output_dir, exist_ok=True)
# Sticky session for this download
# Large files may take minutes; session persistence prevents mid-download blocks
sticky_proxy = f"{THORDATA_PROXY}&session=dl_{video_id[:8]}"
ydl_opts = {
'format': f'best[height<={quality}]',
'proxy': sticky_proxy,
'outtmpl': os.path.join(output_dir, '%(id)s_%(title)s.%(ext)s'),
# Metadata extraction for training alignment
'writethumbnail': True,
'writeinfojson': True,
'writesubtitles': True,
'writeautomaticsub': True,
# Post-processing for training format
'postprocessors': [
{
'key': 'FFmpegVideoConvertor',
'preferedformat': 'mp4',
}
],
# Reliability
'retries': 5,
'fragment_retries': 5,
'skip_unavailable_fragments': True,
'quiet': True,
}
self.download_stats["attempted"] += 1
try:
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(video_metadata["url"], download=True)
filepath = ydl.prepare_filename(info)
file_size = os.path.getsize(filepath) if os.path.exists(filepath) else 0
self.download_stats["successful"] += 1
self.download_stats["bytes_downloaded"] += file_size
return {
"success": True,
"filepath": filepath,
"metadata": {
"duration": info.get("duration"),
"resolution": info.get("resolution"),
"fps": info.get("fps"),
"file_size": file_size,
"source": video_metadata
}
}
except Exception as e:
self.download_stats["failed"] += 1
return {
"success": False,
"error": str(e),
"video_id": video_id
}
def get_stats(self):
"""Return download statistics for pipeline monitoring."""
total = self.download_stats["attempted"]
if total == 0:
return self.download_stats
return {
**self.download_stats,
"success_rate": self.download_stats["successful"] / total,
"average_size": (self.download_stats["bytes_downloaded"] /
max(self.download_stats["successful"], 1))
}
Here’s a concrete example of why this matters. NVIDIA’s Cosmos paper emphasizes that their training data includes “real-world environments, human interactions, and physical dynamics.”
But consider: a world model trained primarily on US and European video will struggle with:
Without geographic targeting in your data collection, you’re building a world model that understands a narrow slice of the world. Residential proxies with country and city-level targeting solve this by letting you collect from the actual regions you need to represent.
ThorData’s geographic targeting goes down to the city level across 195 countries. Need videos from São Paulo specifically, not just “Brazil”? That’s possible. Need first-person footage from Tokyo’s Shibuya crossing? Targetable.
Raw collection is just the start. For world model training, you need filtering:
| Filter | Purpose | Implementation |
| Motion blur detection | Blurry frames teach nothing about physics | Laplacian variance threshold |
| Static scene removal | No temporal dynamics = no causal learning | Frame difference analysis |
| Text overlay detection | Watermarks and subtitles contaminate visual learning | OCR-based filtering |
| Resolution minimum | Too low-res loses fine physical details | 480p floor, 720p preferred |
| Duration sweet spot | Too short lacks context; too long is inefficient | 30 seconds to 5 minutes |
| Audio-visual alignment | Mismatched audio and video breaks multimodal learning | Sync detection algorithms |
These filters run after download but before entering your training store. The pipeline architecture matters because you’re processing 100x the data you’ll actually keep.
| Stage | Videos/Month | Proxy Traffic | Storage | Total Monthly Cost | Timeline |
| Research prototype | 10K | 500 GB | 2 TB | $800 | Months 1-3 |
| Pilot model | 100K | 5 TB | 20 TB | $3,500 | Months 4-6 |
| Production v1 | 1M | 50 TB | 200 TB | $15,000 | Months 7-12 |
| Scale model | 10M | 500 TB | 2 PB | $80,000 | Year 2 |
Proxy costs are typically 15-20% of total infrastructure. The alternative—getting blocked, missing data, building biased models—is far more expensive in lost training time and model performance.
World model training is the next frontier of AI. The architectures are public. The compute is available. The differentiator is data. Not just volume, but diversity, quality, and geographic coverage.
The platforms that host this data are not your friends. They are businesses with their own priorities, and they will block you without hesitation. Residential proxies are not a workaround. They are the infrastructure layer that makes ethical, large-scale data collection possible.
Your world model is only as good as the world it has seen. Make sure it sees the whole world.
Start building your world model data pipeline with ThorData residential proxies
Looking for
Top-Tier Residential Proxies?
您在寻找顶级高质量的住宅代理吗?
How to Set Up Thordata Residential Proxies in VMLogin: Step-by-Step Integration Guide
Learn how to set up Thordata r ...
Jenny Avery
2026-06-16
What Is AI Scraping? A Complete Guide for 2026
Since the early days of the in ...
Xyla Huxley
2026-06-16
Throdata and Morelogin Integration Guide: Build a Safer and More Efficient Multi-Account Management Environment
As a global provider of reside ...
Xyla Huxley
2026-06-16
Web Scraping for Machine Learning: A 2026 Guide
Building algorithms that under ...
Xyla Huxley
2026-06-16
ASN Targeting with Residential Proxies
ASN targeting with residential ...
Kael Odin
2026-06-16
From Sora to Cosmos: The Hidden Infrastructure Behind Physical AI Training Data
The world model race isn't abo ...
Xyla Huxley
2026-06-15
How to Download Sports Highlights at Scale Using Residential Proxies (Python Guide)
The Problem: Why Most Sports Video Downloaders Fail If […]
Unknown
2026-06-12
Why Your Sports Video Downloader Keeps Getting Blocked (And How Residential Proxies Fix It)
The Frustration Is Real You wrote the script. You teste […]
Unknown
2026-06-12
Building an Automated Sports Video Pipeline: From Discovery to Download with Smart Proxies
How to build a zero-touch syst ...
Xyla Huxley
2026-06-12