Claude Training Program · Week 2 · Build Track

Building an Internal Linking
Opportunity Finder

📖 Tutorial 4 of 8 ⏱ 90–120 minutes 👥 Build Track — all levels 🎯 Prerequisite: Tutorial 3

Internal linking is one of the highest-impact, most under-executed areas of technical SEO — and manually identifying opportunities across large sites is tedious and inconsistent. In this tutorial you'll build a tool that reads a sitemap and a set of target keywords, fetches each page, and produces a prioritised report of internal linking opportunities. It's a genuinely useful tool your team will use on real client work.

Learning Objectives

Understand how to plan a multi-step tool before writing any code
Learn how to work with XML sitemaps programmatically
Understand basic keyword-in-content matching techniques
Build a tool that fetches live pages and analyses their content
Produce a formatted HTML report suitable for client sharing
Practice breaking complex tools into phases with Claude Code

1What the Tool Does

Before writing a single line of code, it's worth being precise about what we're building. A vague brief leads to a vague tool. Here's the exact specification:

Tool Architecture — Data Flow

sitemap.xml

→

Parse URLs

→

Fetch & extract text

→

Match keywords

→

HTML report

keywords.csv

→

Each keyword paired with a target URL — the page we want links to point at

For each page on the site, the tool asks: "Does this page mention any of our target keywords — but doesn't already link to the target page for that keyword?" If yes, it's an internal linking opportunity. The report shows which page, which keyword was found, the suggested anchor text, and the target URL to link to.

The two input files you'll need

You'll need to prepare two files before building the tool. Here's what they look like:

data/keywords.csvCSV
keyword,target_url,priority
technical seo audit,https://example.com/services/technical-seo-audit,high
core web vitals,https://example.com/blog/core-web-vitals-guide,high
crawl budget,https://example.com/blog/crawl-budget-optimisation,medium
hreflang,https://example.com/services/international-seo,medium
page speed optimisation,https://example.com/services/page-speed,low

What is a sitemap? An XML sitemap is a file, usually at yourdomain.com/sitemap.xml, that lists all the pages on a website. It's primarily for search engines, but it's also a convenient way for our tool to know which pages to check. View one by opening it in your browser.

2Key Concepts Before We Build

🗺️

XML Parsing

Sitemaps are XML files. Python can read them using the built-in xml.etree.ElementTree library to extract all the URLs listed inside.

🌐

HTML Fetching & Parsing

We fetch each page with requests and use BeautifulSoup to extract the visible text and existing links from the page's HTML.

🔍

Keyword Matching

We check if a keyword phrase appears in a page's text content (case-insensitive). If it does, and the page doesn't already link to the target URL, we flag it.

📄

HTML Report Generation

Instead of a plain CSV, we'll generate a styled HTML file — more readable for sharing with clients or colleagues, and openable in any browser.

Why BeautifulSoup? It's the standard Python library for parsing HTML. It handles messy, real-world HTML gracefully. Claude Code will install it automatically when it builds the script. You don't need to do anything manually.

3Planning the Build in Phases

Complex tools are much easier to build — and debug — when you break them into phases. Rather than asking Claude Code to build everything in one go, we'll work through four phases. Each phase produces something testable before we move on.

Parse the sitemap and extract URLs

~10 min

Build a function that reads a sitemap XML (either a local file or a live URL), extracts all page URLs, and prints them. Verify it works before continuing.

Fetch a page and extract its text and links

~15 min

Build a function that fetches a single URL, strips HTML tags, and returns: (a) the visible text content, (b) all outbound links already on the page. Test on one URL before scaling.

Match keywords and identify opportunities

~15 min

Build the matching logic: for each page, for each keyword, check if the keyword appears in the page text but the target URL is not in the existing links. Collect all matches.

Generate the HTML report

~15 min

Take the collected matches and write them to a styled HTML file, sorted by priority. Include a summary at the top showing total opportunities found.

4Setting Up Your Project Files

Add two new files to your existing seo-tools folder from Tutorial 3:

seo-tools/

├── scripts/

│ ├── check_status.py # from Tutorial 3

│ └── internal_links.py # ← new script (Claude will create this)

├── data/

│ ├── urls.csv # from Tutorial 3

│ └── keywords.csv # ← you create this now

└── output/

├── status_report.csv # from Tutorial 3

└── internal_links_report.html # ← Claude will generate this

Create your keywords.csv now using the format shown in Section 1. Use real keywords and target URLs from one of your clients, or create a fictional example to test with. Aim for 5–15 keywords to start.

5Building the Tool — Phase by Phase

Opening Claude Code

Terminal

$ cd ~/seo-tools $ claude ✓ Claude Code ready. >

Phase 1 prompt — Sitemap parser

I'm building an internal linking opportunity finder. Let's work in phases — start with Phase 1 only. Create a Python script at scripts/internal_links.py. Phase 1: Build a function called get_sitemap_urls(source) that: - Accepts either a local file path (e.g. "data/sitemap.xml") or a live URL (e.g. "https://example.com/sitemap.xml") - Parses the XML and extracts all <loc> URLs - Handles sitemap index files (sitemaps that point to other sitemaps) by recursively fetching child sitemaps - Returns a list of URLs, deduplicated - Prints a summary: "Found X URLs in sitemap" Also add a simple test at the bottom of the file that calls this function with a sitemap URL I provide. Use https://www.bbc.co.uk/sitemap.xml as the test source for now. Do not build anything else yet — just Phase 1.

Large sitemaps: Some sites have sitemaps with tens of thousands of URLs. For testing, it's fine — but when building Phase 3, we'll add a limit parameter so you don't accidentally fetch thousands of live pages in one run.

Once Phase 1 runs successfully and you see a URL count printed, move to Phase 2:

Phase 2 prompt — Page fetcher

Phase 1 is working. Now add Phase 2 to the same script. Add a function called fetch_page_data(url) that: - Fetches the page HTML using requests with a 10 second timeout - Uses BeautifulSoup to parse the HTML - Extracts visible text content (from <p>, <h1>-<h6>, <li> tags — ignore nav, header, footer, script, style) - Extracts all internal links already on the page (href values that are on the same domain) - Returns a dict: {"url": url, "text": full_text, "existing_links": [list of URLs]} - Handles errors gracefully (timeouts, 404s, etc.) — return None if the page can't be fetched Install beautifulsoup4 if needed. Test by calling fetch_page_data() on one URL from the sitemap and printing the first 500 characters of extracted text and the count of existing links found.

Phase 3 prompt — Keyword matching

Phase 2 is working. Now add Phase 3. Add a function called find_opportunities(page_data, keywords) where: - page_data is the dict returned by fetch_page_data() - keywords is a list of dicts loaded from data/keywords.csv with fields: keyword, target_url, priority For each keyword: - Check if the keyword phrase appears in the page text (case-insensitive, whole-word match preferred) - Check if the target_url is NOT already in the page's existing_links - If both conditions are met, record an opportunity Return a list of opportunity dicts, each containing: {"source_page": url, "keyword": keyword, "target_url": target_url, "priority": priority, "context_snippet": the sentence containing the keyword (up to 200 chars)} Also add a main function called run_analysis(sitemap_source, keywords_file, max_pages=50) that: - Loads keywords from the CSV - Gets URLs from the sitemap (up to max_pages) - Loops through each URL, fetches page data, finds opportunities - Adds a 0.5 second delay between requests - Prints progress: "Checking page X of Y: [url]" - Returns the full list of opportunities found

Phase 4 prompt — HTML report

Phase 3 is working. Final phase — generate the report. Add a function called generate_report(opportunities, output_file) that writes a styled HTML file to output/internal_links_report.html. The report should include: - A header with the title "Internal Linking Opportunities Report" and the date/time generated - A summary box at the top: total opportunities found, broken down by priority (High / Medium / Low) - A filterable table (use basic HTML/CSS — no JavaScript frameworks needed) with columns: Source Page (clickable link) | Keyword Found | Suggested Anchor Text | Link To (clickable) | Priority - Rows sorted by priority (High first), then by source page alphabetically - Priority badges: High = green, Medium = orange, Low = grey - Clean, professional styling suitable for sharing with a client After the report function is done, update run_analysis() to call generate_report() automatically at the end and print "Report saved to output/internal_links_report.html". Then run the full analysis against data/keywords.csv using https://[a test domain]/sitemap.xml with max_pages=20. Show me the results.

6What the Finished Script Looks Like

After all four phases, Claude Code will have built a script structured roughly like this. You don't need to type this — it's here for reference so you can understand what was built:

scripts/internal_links.py — structure overviewPython
# ── Imports ──────────────────────────────────────────────
import csv, time, re
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
from urllib.parse import urlparse, urljoin

# ── Phase 1: Sitemap parsing ──────────────────────────────
def get_sitemap_urls(source, visited=None):
  """Parse sitemap XML and return list of page URLs."""
  # Handles both sitemap index files and regular sitemaps
  # Recursively fetches child sitemaps if needed
  ...

# ── Phase 2: Page fetching ────────────────────────────────
def fetch_page_data(url):
  """Fetch a page and return its text content and links."""
  # Uses BeautifulSoup to extract visible text from content tags
  # Ignores nav/header/footer/script/style elements
  ...

# ── Phase 3: Keyword matching ─────────────────────────────
def find_opportunities(page_data, keywords):
  """Find internal linking opportunities on a page."""
  # For each keyword: check text match + check link absence
  ...

def run_analysis(sitemap_source, keywords_file, max_pages=50):
  """Run the full analysis and return all opportunities."""
  # Orchestrates the full pipeline with progress reporting
  ...

# ── Phase 4: Report generation ────────────────────────────
def generate_report(opportunities, output_file):
  """Write a styled HTML report of all opportunities."""
  # Generates self-contained HTML with inline CSS
  ...

# ── Entry point ───────────────────────────────────────────
if __name__ == "__main__":
  opportunities = run_analysis(
      sitemap_source = "https://example.com/sitemap.xml",
      keywords_file  = "data/keywords.csv",
      max_pages      = 50
  )
  generate_report(opportunities, "output/internal_links_report.html")

7Sample Output

The HTML report will open in any browser and look something like this:

📄 output/internal_links_report.html — preview

Source Page	Keyword Found	Suggested Anchor Text	Link To	Priority
/blog/site-speed-guide	core web vitals	core web vitals	/blog/core-web-vitals-guide	● High
/services/seo-consultancy	technical seo audit	technical SEO audit	/services/technical-seo-audit	● High
/blog/crawling-best-practices	crawl budget	crawl budget	/blog/crawl-budget-optimisation	● Medium
/about	page speed optimisation	page speed optimisation	/services/page-speed	● Low

8Useful Follow-Up Improvements

Once your base tool is working, here are valuable additions to ask Claude Code for in the same session:

Improvement	What to say to Claude Code
Add context snippets to report	"Update the report to show a short text snippet (the sentence containing the keyword) in a tooltip or expandable row."
Skip non-content pages	"Add a filter to skip URLs containing /tag/, /category/, /author/, /page/, or /feed/ — these are archive pages we don't want to analyse."
Export to CSV as well	"In addition to the HTML report, also save the opportunities as a CSV at output/internal_links_report.csv."
Accept command-line arguments	"Add argparse support so I can run: python scripts/internal_links.py --sitemap https://example.com/sitemap.xml --max-pages 100"
Respect robots.txt	"Before fetching any pages, check the site's robots.txt and skip any URLs that are disallowed."

9Practice Exercises

✏️ Exercise 1 — Build the tool phase by phase

Work through all four phases using the prompts in Section 5:

Create your keywords.csv with 8–12 real or realistic keywords
Run Phase 1 and confirm you can parse a live sitemap
Run Phase 2 and confirm you can extract text and links from a page
Run Phase 3 and confirm opportunities are being found
Run Phase 4 and open the HTML report in your browser
If any phase fails, paste the error message into Claude Code and let it fix it

✏️ Exercise 2 — Run on a real client site

Use the finished tool on a live client (with permission):

Update keywords.csv with real target keywords and pages for the client
Run the script with max_pages=30 to start — check how long it takes
Open the HTML report and review the findings manually — do they look accurate?
Flag any false positives (keyword appears in an irrelevant context) to Claude Code and ask it to improve the matching logic

✏️ Exercise 3 — Add one improvement from Section 8

Pick any one improvement from the table above and add it:

Choose the improvement most useful to your workflow
Use the suggested prompt as a starting point, but adapt it if needed
Test the updated script confirms the improvement works
If you added CLI arguments, practice running the script with different flags

10Summary

🗺️

Phase-by-Phase Building

Breaking complex tools into phases makes them easier to build, test, and debug

🔍

Real SEO Value

This tool surfaces opportunities a manual review would take hours to find

📄

Client-Ready Output

HTML reports are shareable, readable, and look professional without extra work

Key takeaway: The phased approach is the right way to build any non-trivial tool with Claude Code. Each phase is testable and self-contained — if something breaks, you know exactly which phase to look at. Never ask Claude Code to build an entire complex tool in one shot.