EN
English
简体中文
Log inGet started for free

Blog

Tutorials

beautifulsoup-tutorial-2026-parse-html-data-with-python

BeautifulSoup Tutorial 2026: Parse HTML Data With Python

BeautifulSoup Tutorial 2026: Parse HTML Data With Python

BeautifulSoup Tutorial - Parsing Web Data With Python

Content by Kael Odin

author Kael Odin
Kael Odin
Last updated on
February 28, 2026
12 min read

BeautifulSoup Tutorial: How to Parse Web Data With Python (2026)

HTML pages are everywhere in 2026: product catalogs, job boards, pricing tables, documentation, news sites, and more. If you work with Python, BeautifulSoup is still one of the fastest ways to turn that raw HTML into structured data you can search, analyze, and feed into downstream systems.

This tutorial walks through a complete, copy-paste ready workflow: you’ll start with a small sample HTML file, learn how to parse it with BeautifulSoup, then move on to real HTTP responses, CSS selectors, and exporting data to CSV. Along the way, you’ll see how the same patterns scale to larger web scraping projects powered by managed infrastructure, so you don’t have to maintain brittle scrapers yourself—and how to combine this parser with solid Python basics like those covered in our syntax error and debugging guides.

Key Takeaways:
• Install and configure beautifulsoup4 and requests in a clean Python environment
• Parse a local HTML file and learn the core BeautifulSoup APIs: find, find_all, and select
• Build a practical parser that extracts product-like data and exports it to CSV
• Understand the limits of BeautifulSoup on JavaScript-heavy pages and how to combine it with managed scraping solutions
• Get a quick-reference table of the most common parsing patterns you’ll use daily

1. Setup: Install BeautifulSoup and Requests

We’ll assume you already have Python 3.10+ installed. If you’re on Windows, make sure you checked the “Add Python to PATH” box during installation so commands like python and pip work in your terminal.

Create and activate a virtual environment

python -m venv .venv

# Windows PowerShell
.\.venv\Scripts\Activate.ps1

# macOS / Linux
source .venv/bin/activate

Install BeautifulSoup and Requests

We’ll use beautifulsoup4 for parsing and requests for making HTTP calls. Optionally, you can install lxml for faster parsing:

pip install beautifulsoup4 requests lxml

2. Create a Sample HTML File

To understand the basics of BeautifulSoup, we’ll start with a simple, static HTML snippet representing a product list. Save the following content as sample_products.html in your project directory:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Sample Product List</title>
  </head>
  <body>
    <h1>Top Selling Products</h1>

    <ul id="products">
      <li class="product" data-sku="A100">
        <span class="name">Data Center Proxy Plan</span>
        <span class="price">49.00</span>
        <span class="currency">USD</span>
      </li>
      <li class="product" data-sku="A200">
        <span class="name">Residential Proxy Plan</span>
        <span class="price">99.00</span>
        <span class="currency">USD</span>
      </li>
      <li class="product featured" data-sku="A300">
        <span class="name">Web Scraper API Bundle</span>
        <span class="price">199.00</span>
        <span class="currency">USD</span>
      </li>
    </ul>
  </body>
</html>

This HTML is much simpler than a real e-commerce page, but it’s perfect for learning the core BeautifulSoup patterns.

3. First Steps With BeautifulSoup: Load and Inspect

Create a Python file named beautifulsoup_intro.py and paste the following code. This loads your local HTML file and prints out the top-level tags:

from bs4 import BeautifulSoup

HTML_FILE = "sample_products.html"

with open(HTML_FILE, "r", encoding="utf-8") as f:
    html = f.read()

soup = BeautifulSoup(html, "html.parser")

print("Document title:", soup.title.string)
print("Main heading:", soup.h1.string)

print("\nAll direct children of <body>:")
for child in soup.body.children:
    if getattr(child, "name", None):
        print(" -", child.name)

Run it:

python beautifulsoup_intro.py

You should see output similar to:

Document title: Sample Product List
Main heading: Top Selling Products

All direct children of <body>:
 - h1
 - ul

4. Finding Elements: find, find_all, and select

BeautifulSoup provides several powerful methods for locating elements:

Method Use Case Example
find() First match soup.find("ul", id="products")
find_all() All matches soup.find_all("li", class_="product")
select() CSS selectors soup.select("ul#products li.product")

Extract product data into Python dicts

Let’s parse all products into a list of dictionaries. Create parse_products.py:

from bs4 import BeautifulSoup
from pathlib import Path

HTML_FILE = "sample_products.html"

def parse_products(html: str):
  soup = BeautifulSoup(html, "html.parser")
  items = []

  for li in soup.select("ul#products li.product"):
    name_el = li.select_one(".name")
    price_el = li.select_one(".price")
    currency_el = li.select_one(".currency")

    items.append(
      {
        "sku": li.get("data-sku"),
        "name": name_el.get_text(strip=True) if name_el else "",
        "price": float(price_el.get_text(strip=True)) if price_el else None,
        "currency": currency_el.get_text(strip=True) if currency_el else "",
        "featured": "featured" in li.get("class", []),
      }
    )

  return items


def main() -> None:
  html = Path(HTML_FILE).read_text(encoding="utf-8")
  products = parse_products(html)

  print(f"Found {len(products)} products:")
  for p in products:
    print(f" - {p['sku']}: {p['name']} ({p['price']} {p['currency']})"
          + (" [FEATURED]" if p["featured"] else ""))


if __name__ == "__main__":
  main()

Run it and you should get a neatly formatted list of products extracted from your HTML file.

5. Parsing Real HTTP Responses With BeautifulSoup

So far we’ve worked with a local file. In real web scraping projects, you’ll usually fetch HTML over HTTP (using requests or a managed scraper) and then pass the response text to BeautifulSoup.

Here’s a minimal example that fetches https://httpbin.org/html and prints the main heading text:

import requests
from bs4 import BeautifulSoup

URL = "https://httpbin.org/html"

resp = requests.get(URL, timeout=10)
resp.raise_for_status()

soup = BeautifulSoup(resp.text, "html.parser")

title = soup.find("h1")
print("Page heading:", title.get_text(strip=True) if title else "(not found)")
Important: Always review a website’s terms of service and robots.txt before scraping it. For production workloads, you should handle rate limits, retries, and IP rotation responsibly, or use a managed scraping platform to minimize operational and legal risk.

6. Export Parsed Data to CSV

Once you’ve parsed HTML into Python objects, you’ll often want to export that data into CSV for analysis in tools like Excel, Google Sheets, or a data warehouse.

Let’s extend our product parser to write a CSV file using pandas:

from bs4 import BeautifulSoup
from pathlib import Path
import pandas as pd

HTML_FILE = "sample_products.html"

def parse_products(html: str):
  soup = BeautifulSoup(html, "html.parser")
  items = []

  for li in soup.select("ul#products li.product"):
    name_el = li.select_one(".name")
    price_el = li.select_one(".price")
    currency_el = li.select_one(".currency")

    items.append(
      {
        "sku": li.get("data-sku"),
        "name": name_el.get_text(strip=True) if name_el else "",
        "price": float(price_el.get_text(strip=True)) if price_el else None,
        "currency": currency_el.get_text(strip=True) if currency_el else "",
      }
    )

  return items


def main() -> None:
  html = Path(HTML_FILE).read_text(encoding="utf-8")
  products = parse_products(html)

  df = pd.DataFrame(products)
  df.to_csv("products.csv", index=False, encoding="utf-8")
  print("Exported products.csv with", len(df), "rows")


if __name__ == "__main__":
  main()

After running this script, you should see a new products.csv file in your project directory containing the parsed data.

7. CSS Selectors and Advanced Queries

BeautifulSoup’s select() and select_one() methods support a useful subset of CSS selectors. Here are a few patterns you’ll use frequently:

Pattern Selector Description
By ID soup.select_one("#products") Element with id=”products”
By class soup.select(".product.featured") All elements with class “product” and “featured”
Tag + class soup.select("li.product .price") All elements with class “price” inside <li class="product">
Attribute soup.select('li[data-sku="A200"]') Product with SKU A200

8. Dynamic Pages and Managed Scraping

BeautifulSoup is perfect for parsing HTML you already have, but it doesn’t execute JavaScript. If your target pages are heavily dynamic (client-side rendering, infinite scroll, complex anti-bot protections), you’ll need an additional layer to render or fetch HTML reliably.

Many teams choose a hybrid approach: use a managed scraping platform to handle JavaScript rendering, IP rotation, and anti-bot logic, then feed the resulting HTML into BeautifulSoup. This separation lets your Python code stay small and focused on parsing and business logic, while the infrastructure concerns are handled elsewhere.

For example, Thordata provides scraping APIs and tools designed to return clean, structured results from complex targets. You can manage your API tokens, monitor usage, and configure scraping jobs in the Thordata Dashboard, while keeping your parsing logic in Python with BeautifulSoup. To see how Thordata’s Python SDK works in practice, check out the open source repository here: Thordata Python SDK.

9. Common BeautifulSoup Mistakes (and How to Avoid Them)

  • Not checking for missing elements: Directly calling .get_text() on None will raise an exception. Always guard with if el or use helper functions.
  • Using the wrong parser: If you see odd parsing behavior, try switching from "html.parser" to "lxml" (after installing lxml).
  • Ignoring encoding: When reading local files or HTTP responses, make sure to use the correct encoding (often UTF-8) to avoid garbled characters.
  • Scraping dynamic content directly: If elements are rendered by JavaScript, the raw HTML fetched with requests may not include them. Use a headless browser or a managed scraper to get the final HTML.

10. Quick Reference

Goal BeautifulSoup Pattern
Create soup soup = BeautifulSoup(html, "html.parser")
Find first tag soup.find("h1")
Find all tags soup.find_all("li")
Find by id soup.find("ul", id="products")
Find by class soup.find_all("li", class_="product")
CSS selector soup.select("ul#products li.product .price")
Get text element.get_text(strip=True)
Get attribute element["data-sku"] or element.get("href")

Further Reading

Get started for free

Frequently asked questions

Is BeautifulSoup easy to learn?

Yes. BeautifulSoup has a relatively low learning curve. If you understand basic Python and HTML, you can start extracting data quickly using methods like find, find_all, and select. The official documentation also includes many examples to help you progress.

Is BeautifulSoup enough for production web scraping?

BeautifulSoup is excellent for parsing HTML and XML, but it doesn’t handle JavaScript rendering, IP rotation, or large-scale crawling. For serious production workloads, you typically combine BeautifulSoup with robust HTTP clients, headless browsers, or managed scraping platforms that provide anti-bot handling and infrastructure.

Should I use BeautifulSoup or Scrapy?

Use BeautifulSoup for smaller tasks and focused HTML parsing when you already have the page content. Scrapy is a full web scraping framework that adds built-in crawling, concurrency, and pipeline features. Many teams start with BeautifulSoup and later adopt Scrapy or managed scraping solutions as their projects grow.

Can I use BeautifulSoup with other data tools?

Absolutely. BeautifulSoup works well with libraries like pandas and SQLAlchemy, and with cloud storage or data warehouses. It’s common to parse HTML with BeautifulSoup, turn the results into a pandas DataFrame, and then export to CSV, Parquet, or a database for downstream analysis.

About the author

Kael is a Senior Technical Copywriter at Thordata. He works closely with data engineers to document best practices for web scraping, HTML parsing, and API integrations. His focus is on creating hands-on tutorials that can be copied, run, and adapted to real-world projects.

The thordata Blog offers all its content in its original form and solely for informational intent. We do not offer any guarantees regarding the information found on the thordata Blog or any external sites that it may direct you to. It is essential that you seek legal counsel and thoroughly examine the specific terms of service of any website before engaging in any scraping endeavors, or obtain a scraping permit if required.