Building a simple scraper in Go: concurrency, HTTP and parsing

Tutorial for building a web scraper in Go with net/http, goquery, rate limiting and controlled concurrency. Practical scraping.

Cover for Building a simple scraper in Go: concurrency, HTTP and parsing

BeautifulSoup + requests in Python is faster to write. You can have a working scraper in fifteen lines, with HTML parsing, session handling and CSV export without breaking a sweat. For one-off scraping, I still use Python. But when I needed to scrape 50,000 pages concurrently, with fine-grained control over connections, retries and without dragging a virtualenv into production, Go was the option that fit.

Go is not the most comfortable language for quick scraping. That’s a fact. It doesn’t have the Scrapy ecosystem, nor the extraction tools community that Python has. But it has goroutines, compilation to a static binary, a solid standard HTTP library and a concurrency model that doesn’t need asyncio or event loops. For scrapers that will run as services, in containers, processing large volumes, that matters.

What we’re going to build here is a small but real scraper. It makes HTTP requests, parses HTML, extracts structured data, handles errors, respects rate limits and runs with controlled concurrency. If you’re coming from Python and exploring Go, this will give you a concrete example of how scraping flow translates to this language. If you already know Go, you might find a useful pattern for your own scrapers. For a broader comparison between both languages, I have a dedicated article on Go vs Python.


The HTTP client in Go: net/http

Go has an HTTP client in the standard library that doesn’t need anything else. No external dependencies, no wrappers. net/http is what most HTTP tools in Go use under the hood, including frameworks like Gin or libraries like Resty.

The most basic way to make a GET request:

package main

import (
	"fmt"
	"io"
	"net/http"
)

func main() {
	resp, err := http.Get("https://example.com")
	if err != nil {
		fmt.Println("Error:", err)
		return
	}
	defer resp.Body.Close()

	body, err := io.ReadAll(resp.Body)
	if err != nil {
		fmt.Println("Error reading body:", err)
		return
	}

	fmt.Println(string(body))
}

It works, but it has a fundamental problem for scraping: it uses the default HTTP client (http.DefaultClient), which has no timeout. If a server takes ten minutes to respond, your program will wait ten minutes. In a concurrent scraper, that’s a disaster.

The first thing you need is to create your own client with explicit configuration:

client := &http.Client{
	Timeout: 10 * time.Second,
	Transport: &http.Transport{
		MaxIdleConns:        100,
		MaxIdleConnsPerHost: 10,
		IdleConnTimeout:     30 * time.Second,
	},
}

Timeout is the total request timeout (including body reading). Transport controls the connection pool. MaxIdleConnsPerHost is important for scraping: if you’re making many requests to the same domain, you want to reuse TCP connections instead of opening a new one each time.

For more configurable requests, use http.NewRequest instead of http.Get:

req, err := http.NewRequest("GET", url, nil)
if err != nil {
	return fmt.Errorf("creating request for %s: %w", url, err)
}

req.Header.Set("User-Agent", "MyScraper/1.0 (+https://example.com/bot)")
req.Header.Set("Accept", "text/html")
req.Header.Set("Accept-Language", "en-US,en;q=0.9")

resp, err := client.Do(req)
if err != nil {
	return fmt.Errorf("doing GET %s: %w", url, err)
}
defer resp.Body.Close()

Notice the User-Agent. It’s not optional. It’s the minimum you should do as a responsible scraper: identify yourself. Many servers block requests without a User-Agent or with generic User-Agents.


HTML parsing with goquery

Go doesn’t have a BeautifulSoup equivalent in the standard library. It has golang.org/x/net/html for parsing HTML, but its API is low-level and working with it directly is tedious. The library everyone uses for scraping in Go is goquery. It’s the jQuery equivalent for Go: CSS selectors, DOM traversal, text and attribute extraction.

Install it with:

go get github.com/PuerkitoBio/goquery

Basic usage:

package main

import (
	"fmt"
	"log"
	"net/http"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	resp, err := http.Get("https://example.com")
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	doc, err := goquery.NewDocumentFromReader(resp.Body)
	if err != nil {
		log.Fatal(err)
	}

	// Extract the page title
	title := doc.Find("title").Text()
	fmt.Println("Title:", title)

	// Extract all links
	doc.Find("a").Each(func(i int, s *goquery.Selection) {
		href, exists := s.Attr("href")
		if exists {
			fmt.Printf("Link %d: %s -> %s\n", i, s.Text(), href)
		}
	})
}

goquery CSS selectors cover practically everything you need:

// By class
doc.Find(".article-title")

// By ID
doc.Find("#main-content")

// Compound selectors
doc.Find("div.product > h2.name")

// Attributes
doc.Find("a[href^='https']")

// Pseudo-selectors
doc.Find("tr:nth-child(even)")

To extract data, the most common methods are:

// Element text
text := s.Text()

// Attribute
href, exists := s.Attr("href")

// Inner HTML
html, err := s.Html()

// First matching element
first := doc.Find(".item").First()

// Iterate all elements
doc.Find(".item").Each(func(i int, s *goquery.Selection) {
	// ...
})

If you’re coming from BeautifulSoup, the mental translation is direct. soup.select(".class") is doc.Find(".class"). tag.get_text() is s.Text(). tag["href"] is s.Attr("href").


Building the scraper: extracting data from a page

Let’s build something concrete. Imagine we want to scrape a fictional news site and extract articles from the main page: title, link, summary and date.

First, we define the data structure:

type Article struct {
	Title   string `json:"title"`
	URL     string `json:"url"`
	Summary string `json:"summary"`
	Date    string `json:"date"`
}

Now, the function that parses a page and extracts articles:

func parseArticles(doc *goquery.Document, baseURL string) []Article {
	var articles []Article

	doc.Find("article.post").Each(func(i int, s *goquery.Selection) {
		title := strings.TrimSpace(s.Find("h2.post-title").Text())
		if title == "" {
			return // Skip elements without a title
		}

		href, exists := s.Find("h2.post-title a").Attr("href")
		if !exists {
			return
		}

		// Resolve relative URLs
		fullURL := resolveURL(baseURL, href)

		summary := strings.TrimSpace(s.Find("p.post-summary").Text())
		date := strings.TrimSpace(s.Find("time").AttrOr("datetime", ""))

		articles = append(articles, Article{
			Title:   title,
			URL:     fullURL,
			Summary: summary,
			Date:    date,
		})
	})

	return articles
}

The resolveURL function converts relative URLs to absolute ones:

func resolveURL(base, ref string) string {
	baseURL, err := url.Parse(base)
	if err != nil {
		return ref
	}

	refURL, err := url.Parse(ref)
	if err != nil {
		return ref
	}

	return baseURL.ResolveReference(refURL).String()
}

And the function that makes the HTTP request and connects everything:

func fetchArticles(client *http.Client, pageURL string) ([]Article, error) {
	req, err := http.NewRequest("GET", pageURL, nil)
	if err != nil {
		return nil, fmt.Errorf("creating request: %w", err)
	}
	req.Header.Set("User-Agent", "GoScraper/1.0")

	resp, err := client.Do(req)
	if err != nil {
		return nil, fmt.Errorf("fetch %s: %w", pageURL, err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		return nil, fmt.Errorf("status %d for %s", resp.StatusCode, pageURL)
	}

	doc, err := goquery.NewDocumentFromReader(resp.Body)
	if err != nil {
		return nil, fmt.Errorf("parsing HTML from %s: %w", pageURL, err)
	}

	return parseArticles(doc, pageURL), nil
}

Notice the pattern: each error is wrapped with context using %w. This lets you know exactly what failed and where when debugging. If Go’s error handling seems excessive to you, I recommend reading my article on errors in Go where I explain why this verbosity is a real advantage.


Adding concurrency: goroutines and worker pool

Up to here we have a sequential scraper. It works, but if you have 1,000 pages to scrape, it will take forever. This is where Go shines.

The naive approach (don’t do this)

// DON'T do this
for _, url := range urls {
	go func(u string) {
		articles, err := fetchArticles(client, u)
		// ...
	}(url)
}

Launching a goroutine per URL without control will cause you to fire 1,000 simultaneous requests. The server will block you, you’ll exhaust file descriptors and your scraper will blow up. It’s the equivalent of opening a thousand browser tabs at once.

Worker pool: controlled concurrency

The correct pattern is a worker pool. A fixed number of goroutines (workers) process URLs from a shared channel. This gives you real but controlled concurrency. If you want to dig deeper into this pattern, I have a dedicated article on worker pools in Go.

func scrapeWithWorkers(client *http.Client, urls []string, numWorkers int) []Article {
	var (
		mu      sync.Mutex
		results []Article
		wg      sync.WaitGroup
	)

	jobs := make(chan string, len(urls))

	// Launch workers
	for i := 0; i < numWorkers; i++ {
		wg.Add(1)
		go func(workerID int) {
			defer wg.Done()
			for url := range jobs {
				articles, err := fetchArticles(client, url)
				if err != nil {
					log.Printf("[Worker %d] Error scraping %s: %v", workerID, url, err)
					continue
				}

				mu.Lock()
				results = append(results, articles...)
				mu.Unlock()

				log.Printf("[Worker %d] OK: %s (%d articles)", workerID, url, len(articles))
			}
		}(i)
	}

	// Send URLs to the channel
	for _, u := range urls {
		jobs <- u
	}
	close(jobs)

	// Wait for all workers to finish
	wg.Wait()

	return results
}

Let’s break down what happens:

  1. jobs channel: acts as a work queue. Workers read from this channel.
  2. sync.WaitGroup: lets us wait for all workers to finish.
  3. sync.Mutex: protects the results slice from concurrent writes. Without this, you’d have a race condition.
  4. range jobs: each worker reads URLs from the channel until it closes. This is idiomatic in Go.

With numWorkers = 10, you have ten goroutines processing URLs in parallel. If a request takes 2 seconds, instead of taking 2,000 seconds for 1,000 URLs, it takes around 200 seconds. Real concurrency without asyncio, without callbacks, without promises.

For finer control, you can add context in Go to cancel scraping if something goes wrong:

func scrapeWithContext(ctx context.Context, client *http.Client, urls []string, numWorkers int) ([]Article, error) {
	var (
		mu      sync.Mutex
		results []Article
		wg      sync.WaitGroup
	)

	jobs := make(chan string, len(urls))

	for i := 0; i < numWorkers; i++ {
		wg.Add(1)
		go func(workerID int) {
			defer wg.Done()
			for url := range jobs {
				select {
				case <-ctx.Done():
					return
				default:
				}

				articles, err := fetchArticles(client, url)
				if err != nil {
					log.Printf("[Worker %d] Error: %v", workerID, err)
					continue
				}

				mu.Lock()
				results = append(results, articles...)
				mu.Unlock()
			}
		}(i)
	}

	for _, u := range urls {
		select {
		case jobs <- u:
		case <-ctx.Done():
			close(jobs)
			wg.Wait()
			return results, ctx.Err()
		}
	}
	close(jobs)
	wg.Wait()

	return results, nil
}

The select with ctx.Done() allows each worker to check if the context has been cancelled before processing the next URL. If you call cancel() from outside, all workers finish cleanly.


Rate limiting: time.Ticker and semaphore

Having controlled concurrency with a worker pool is not enough. You need rate limiting. Even with only 5 workers, if responses are fast, you can make hundreds of requests per second. That will draw attention from the server and you’ll likely get blocked.

Rate limiting with time.Ticker

time.Ticker emits a value on a channel at regular intervals. You can use it as a rate limiter:

func scrapeWithRateLimit(client *http.Client, urls []string, numWorkers int, requestsPerSecond int) []Article {
	var (
		mu      sync.Mutex
		results []Article
		wg      sync.WaitGroup
	)

	jobs := make(chan string, len(urls))
	ticker := time.NewTicker(time.Second / time.Duration(requestsPerSecond))
	defer ticker.Stop()

	for i := 0; i < numWorkers; i++ {
		wg.Add(1)
		go func(workerID int) {
			defer wg.Done()
			for url := range jobs {
				<-ticker.C // Wait for the next tick

				articles, err := fetchArticles(client, url)
				if err != nil {
					log.Printf("[Worker %d] Error: %v", workerID, err)
					continue
				}

				mu.Lock()
				results = append(results, articles...)
				mu.Unlock()
			}
		}(i)
	}

	for _, u := range urls {
		jobs <- u
	}
	close(jobs)
	wg.Wait()

	return results
}

With requestsPerSecond = 5, the ticker emits a value every 200ms. Each worker must wait for a tick to be available before making its request. This gives you a maximum of 5 requests per second, regardless of how many workers you have.

Semaphore with buffered channel

Another option is to use a buffered channel as a semaphore to limit active concurrent requests:

type Scraper struct {
	client    *http.Client
	semaphore chan struct{}
	delay     time.Duration
}

func NewScraper(maxConcurrent int, delay time.Duration) *Scraper {
	return &Scraper{
		client: &http.Client{
			Timeout: 10 * time.Second,
		},
		semaphore: make(chan struct{}, maxConcurrent),
		delay:     delay,
	}
}

func (s *Scraper) Fetch(url string) ([]Article, error) {
	s.semaphore <- struct{}{} // Acquire slot
	defer func() {
		time.Sleep(s.delay) // Delay between requests
		<-s.semaphore       // Release slot
	}()

	return fetchArticles(s.client, url)
}

The semaphore channel has a buffer of size maxConcurrent. When it’s full, the next s.semaphore <- struct{}{} blocks until a worker releases its slot. Combined with time.Sleep(s.delay) after each request, you have control over both concurrency and speed.


Error handling and retries

In scraping, errors are the norm, not the exception. Timeouts, 429 (Too Many Requests), 503 (Service Unavailable), reset connections, malformed HTML. Your scraper has to handle all this without crashing.

Retries with exponential backoff

func fetchWithRetry(client *http.Client, url string, maxRetries int) (*http.Response, error) {
	var lastErr error

	for attempt := 0; attempt <= maxRetries; attempt++ {
		if attempt > 0 {
			backoff := time.Duration(1<<uint(attempt-1)) * time.Second // 1s, 2s, 4s, 8s...
			jitter := time.Duration(rand.Int63n(int64(500 * time.Millisecond)))
			time.Sleep(backoff + jitter)
			log.Printf("Retry %d/%d for %s", attempt, maxRetries, url)
		}

		req, err := http.NewRequest("GET", url, nil)
		if err != nil {
			return nil, fmt.Errorf("creating request: %w", err)
		}
		req.Header.Set("User-Agent", "GoScraper/1.0")

		resp, err := client.Do(req)
		if err != nil {
			lastErr = fmt.Errorf("attempt %d: %w", attempt, err)
			continue
		}

		// Retry on certain status codes
		if resp.StatusCode == http.StatusTooManyRequests ||
			resp.StatusCode == http.StatusServiceUnavailable ||
			resp.StatusCode >= 500 {
			resp.Body.Close()
			lastErr = fmt.Errorf("attempt %d: status %d", attempt, resp.StatusCode)

			// If there's a Retry-After header, respect it
			if retryAfter := resp.Header.Get("Retry-After"); retryAfter != "" {
				if seconds, err := strconv.Atoi(retryAfter); err == nil {
					time.Sleep(time.Duration(seconds) * time.Second)
				}
			}
			continue
		}

		return resp, nil
	}

	return nil, fmt.Errorf("exhausted %d retries for %s: %w", maxRetries, url, lastErr)
}

Key points:

  • Exponential backoff: 1s, 2s, 4s, 8s… Each retry waits twice as long as the previous.
  • Jitter: a random component to prevent all workers from retrying at the same time (thundering herd).
  • Retry-After: if the server tells you how long to wait, listen to it.
  • Only retry recoverable errors: a 404 makes no sense to retry. A 429 or 503, yes.

Classifying errors

Not all errors deserve the same treatment:

func isRetryable(statusCode int) bool {
	switch statusCode {
	case http.StatusTooManyRequests,     // 429
		http.StatusServiceUnavailable,   // 503
		http.StatusBadGateway,           // 502
		http.StatusGatewayTimeout:       // 504
		return true
	default:
		return statusCode >= 500
	}
}

func isSkippable(statusCode int) bool {
	switch statusCode {
	case http.StatusNotFound,   // 404
		http.StatusForbidden,   // 403
		http.StatusGone:        // 410
		return true
	default:
		return false
	}
}

In the worker, you use this to decide what to do:

if isSkippable(resp.StatusCode) {
	log.Printf("Skipping %s: status %d", url, resp.StatusCode)
	continue
}
if isRetryable(resp.StatusCode) {
	// Retry with backoff
}

Saving results: JSON output

For a simple scraper, JSON is the most practical format. Easy to generate, easy to consume, easy to inspect.

Writing results to a file

func saveResults(articles []Article, filename string) error {
	file, err := os.Create(filename)
	if err != nil {
		return fmt.Errorf("creating file %s: %w", filename, err)
	}
	defer file.Close()

	encoder := json.NewEncoder(file)
	encoder.SetIndent("", "  ")

	if err := encoder.Encode(articles); err != nil {
		return fmt.Errorf("writing JSON: %w", err)
	}

	return nil
}

Incremental writing with JSON Lines

If the scraper is going to run for hours, you don’t want to accumulate everything in memory and write at the end. Use JSON Lines (one JSON object per line):

func newResultWriter(filename string) (*ResultWriter, error) {
	file, err := os.OpenFile(filename, os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0644)
	if err != nil {
		return nil, err
	}

	return &ResultWriter{
		file:    file,
		encoder: json.NewEncoder(file),
		mu:      sync.Mutex{},
	}, nil
}

type ResultWriter struct {
	file    *os.File
	encoder *json.Encoder
	mu      sync.Mutex
}

func (w *ResultWriter) Write(article Article) error {
	w.mu.Lock()
	defer w.mu.Unlock()
	return w.encoder.Encode(article)
}

func (w *ResultWriter) Close() error {
	return w.file.Close()
}

With sync.Mutex, multiple workers can write to the file safely. Each Encode writes a complete line, so if the scraper crashes midway, you don’t lose already written data.


Respecting robots.txt and being a good citizen

Just because you can scrape a site doesn’t mean you should do it without consideration. There are basic rules every scraper should follow.

Checking robots.txt

import "github.com/temoto/robotstxt"

func checkRobotsTxt(client *http.Client, siteURL, userAgent string) (*robotstxt.Group, error) {
	robotsURL := siteURL + "/robots.txt"

	resp, err := client.Get(robotsURL)
	if err != nil {
		return nil, fmt.Errorf("fetching robots.txt: %w", err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		// No robots.txt, assume everything is allowed
		return nil, nil
	}

	robots, err := robotstxt.FromResponse(resp)
	if err != nil {
		return nil, fmt.Errorf("parsing robots.txt: %w", err)
	}

	return robots.FindGroup(userAgent), nil
}

// Before scraping a URL
func canFetch(group *robotstxt.Group, path string) bool {
	if group == nil {
		return true
	}
	return group.Test(path)
}

Use it before each request:

parsedURL, _ := url.Parse(targetURL)
if !canFetch(robotsGroup, parsedURL.Path) {
	log.Printf("Blocked by robots.txt: %s", targetURL)
	continue
}

General best practices

Beyond robots.txt, there are principles you should follow:

  1. Identify yourself: Use a descriptive User-Agent. Include a contact URL.
  2. Always rate limit: Maximum 1-2 requests per second to the same domain, unless you know the server can handle more.
  3. Respect Retry-After: If the server tells you to wait, wait.
  4. Don’t scrape protected content: If there’s login, CAPTCHA or terms of use that prohibit it, don’t do it.
  5. Cache: If you already have a page downloaded, don’t request it again.
  6. Timing: If you can choose, scrape during off-peak hours.

This isn’t just ethics. It’s pragmatism. A scraper that behaves well lasts longer without being blocked.


Complete working example

Here’s the complete scraper, putting together everything we’ve seen. This code is functional: you can copy it, adjust the CSS selectors and run it.

package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"math/rand"
	"net/http"
	"net/url"
	"os"
	"strconv"
	"strings"
	"sync"
	"time"

	"github.com/PuerkitoBio/goquery"
)

// --- Types ---

type Article struct {
	Title   string `json:"title"`
	URL     string `json:"url"`
	Summary string `json:"summary"`
	Date    string `json:"date"`
}

type ScraperConfig struct {
	MaxWorkers        int
	RequestsPerSecond int
	MaxRetries        int
	Timeout           time.Duration
	UserAgent         string
}

// --- HTTP client ---

func newHTTPClient(cfg ScraperConfig) *http.Client {
	return &http.Client{
		Timeout: cfg.Timeout,
		Transport: &http.Transport{
			MaxIdleConns:        100,
			MaxIdleConnsPerHost: 10,
			IdleConnTimeout:     30 * time.Second,
		},
	}
}

// --- Request with retries ---

func fetchWithRetry(ctx context.Context, client *http.Client, url string, userAgent string, maxRetries int) (*http.Response, error) {
	var lastErr error

	for attempt := 0; attempt <= maxRetries; attempt++ {
		if attempt > 0 {
			backoff := time.Duration(1<<uint(attempt-1)) * time.Second
			jitter := time.Duration(rand.Int63n(int64(500 * time.Millisecond)))

			select {
			case <-ctx.Done():
				return nil, ctx.Err()
			case <-time.After(backoff + jitter):
			}
			log.Printf("Retry %d/%d for %s", attempt, maxRetries, url)
		}

		req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
		if err != nil {
			return nil, fmt.Errorf("creating request: %w", err)
		}
		req.Header.Set("User-Agent", userAgent)
		req.Header.Set("Accept", "text/html")

		resp, err := client.Do(req)
		if err != nil {
			lastErr = fmt.Errorf("attempt %d: %w", attempt, err)
			continue
		}

		if resp.StatusCode == http.StatusTooManyRequests ||
			resp.StatusCode >= 500 {
			resp.Body.Close()
			lastErr = fmt.Errorf("attempt %d: status %d", attempt, resp.StatusCode)

			if retryAfter := resp.Header.Get("Retry-After"); retryAfter != "" {
				if seconds, err := strconv.Atoi(retryAfter); err == nil {
					time.Sleep(time.Duration(seconds) * time.Second)
				}
			}
			continue
		}

		return resp, nil
	}

	return nil, fmt.Errorf("exhausted %d retries for %s: %w", maxRetries, url, lastErr)
}

// --- Parsing ---

func resolveURL(base, ref string) string {
	baseURL, err := url.Parse(base)
	if err != nil {
		return ref
	}
	refURL, err := url.Parse(ref)
	if err != nil {
		return ref
	}
	return baseURL.ResolveReference(refURL).String()
}

func parseArticles(doc *goquery.Document, baseURL string) []Article {
	var articles []Article

	doc.Find("article.post").Each(func(i int, s *goquery.Selection) {
		title := strings.TrimSpace(s.Find("h2.post-title").Text())
		if title == "" {
			return
		}

		href, exists := s.Find("h2.post-title a").Attr("href")
		if !exists {
			return
		}

		articles = append(articles, Article{
			Title:   title,
			URL:     resolveURL(baseURL, href),
			Summary: strings.TrimSpace(s.Find("p.post-summary").Text()),
			Date:    strings.TrimSpace(s.Find("time").AttrOr("datetime", "")),
		})
	})

	return articles
}

func fetchArticles(ctx context.Context, client *http.Client, pageURL string, cfg ScraperConfig) ([]Article, error) {
	resp, err := fetchWithRetry(ctx, client, pageURL, cfg.UserAgent, cfg.MaxRetries)
	if err != nil {
		return nil, err
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		return nil, fmt.Errorf("status %d for %s", resp.StatusCode, pageURL)
	}

	doc, err := goquery.NewDocumentFromReader(resp.Body)
	if err != nil {
		return nil, fmt.Errorf("parsing HTML from %s: %w", pageURL, err)
	}

	return parseArticles(doc, pageURL), nil
}

// --- Worker pool with rate limiting ---

func scrape(ctx context.Context, cfg ScraperConfig, urls []string) ([]Article, error) {
	client := newHTTPClient(cfg)

	var (
		mu      sync.Mutex
		results []Article
		wg      sync.WaitGroup
	)

	jobs := make(chan string, len(urls))
	ticker := time.NewTicker(time.Second / time.Duration(cfg.RequestsPerSecond))
	defer ticker.Stop()

	// Launch workers
	for i := 0; i < cfg.MaxWorkers; i++ {
		wg.Add(1)
		go func(workerID int) {
			defer wg.Done()
			for pageURL := range jobs {
				// Check cancellation
				select {
				case <-ctx.Done():
					return
				default:
				}

				// Rate limiting
				<-ticker.C

				articles, err := fetchArticles(ctx, client, pageURL, cfg)
				if err != nil {
					log.Printf("[Worker %d] Error scraping %s: %v", workerID, pageURL, err)
					continue
				}

				mu.Lock()
				results = append(results, articles...)
				mu.Unlock()

				log.Printf("[Worker %d] OK: %s (%d articles)", workerID, pageURL, len(articles))
			}
		}(i)
	}

	// Send URLs
	for _, u := range urls {
		select {
		case jobs <- u:
		case <-ctx.Done():
			break
		}
	}
	close(jobs)
	wg.Wait()

	return results, nil
}

// --- Save results ---

func saveResults(articles []Article, filename string) error {
	file, err := os.Create(filename)
	if err != nil {
		return fmt.Errorf("creating file: %w", err)
	}
	defer file.Close()

	encoder := json.NewEncoder(file)
	encoder.SetIndent("", "  ")
	return encoder.Encode(articles)
}

// --- Main ---

func main() {
	cfg := ScraperConfig{
		MaxWorkers:        5,
		RequestsPerSecond: 2,
		MaxRetries:        3,
		Timeout:           10 * time.Second,
		UserAgent:         "GoScraper/1.0 (+https://example.com/bot)",
	}

	// URLs to scrape (adjust to your case)
	urls := []string{
		"https://example-news.com/page/1",
		"https://example-news.com/page/2",
		"https://example-news.com/page/3",
		"https://example-news.com/page/4",
		"https://example-news.com/page/5",
	}

	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
	defer cancel()

	log.Printf("Starting scraping of %d pages with %d workers", len(urls), cfg.MaxWorkers)

	articles, err := scrape(ctx, cfg, urls)
	if err != nil {
		log.Fatalf("Error in scraping: %v", err)
	}

	log.Printf("Total articles extracted: %d", len(articles))

	if err := saveResults(articles, "results.json"); err != nil {
		log.Fatalf("Error saving results: %v", err)
	}

	log.Println("Results saved to results.json")
}

To run it:

go mod init scraper
go mod tidy
go run main.go

go mod tidy will download goquery and its dependencies automatically. The binary compiled with go build is a static executable you can move to any server without installing anything.


When Python is still better for scraping

It would be dishonest to finish without this. Go has clear advantages for scraping at scale, but Python is still the best option in many scenarios:

Python wins when:

  • You prototype quickly: You want to see if a scraper is viable. BeautifulSoup + requests + a Jupyter notebook. In ten minutes you have data. In Go you spend half an hour setting up the project, defining structs and handling errors.
  • You need Scrapy: Scrapy is a complete scraping framework with middlewares, pipelines, cookie handling, automatic throttling, export to multiple formats and a huge community. Go has nothing comparable.
  • JavaScript rendering: If the site loads content with JavaScript, you need a headless browser. Python has Playwright and Selenium with mature bindings. Go has chromedp, which works but is less ergonomic.
  • One-shot scripts: A scraper you’re going to run once to extract data doesn’t need to be compiled. Python with a virtualenv is fine.
  • Data/ML teams: If the team that will maintain the scraper works in Python and the data goes to a pandas/sklearn pipeline, adding Go to the equation doesn’t add enough.

Go wins when:

  • High volume: Thousands or tens of thousands of pages. Go’s native concurrency and low memory usage make a difference.
  • Scraper as a service: If the scraper is going to run continuously in a container, a 10MB static binary is better than a Python container with dependencies.
  • Backend teams: If the team already works in Go, it doesn’t make sense to introduce Python just for a scraper.
  • Performance matters: HTML parsing in Go (goquery uses the golang.org/x/net/html parser) is significantly faster than BeautifulSoup.
  • Clean deployment: One binary. No runtime, no virtualenv, no pip version conflicts.

The question isn’t “which language is better for scraping”. It’s “what do I need in this specific case”. For a broader comparison, check the Go vs Python article.


From quick script to production tool

We’ve built a Go scraper from scratch that covers the fundamental aspects: configured HTTP client, HTML parsing with goquery, structured data extraction, concurrency with worker pool, rate limiting with time.Ticker, retries with exponential backoff, JSON output and robots.txt compliance.

The patterns we’ve used are the same ones you’ll find in production tools. The worker pool with channels is the standard Go concurrency pattern. Error handling with wrapping is idiomatic. Rate limiting with Ticker is the usual way to control speed.

Go is not the fastest option for putting together a quick scraper. But when you need a scraper that runs in production, that handles concurrency without pain, that deploys as a static binary and that scales without dragging dependencies, it makes sense. Especially if you’re already working in Go for the rest of your backend.

The complete code in this article is a starting point. Adapt it to your case: change the CSS selectors, adjust the number of workers, add database persistence instead of JSON, integrate metrics with Prometheus. The base structure is the same.

OshyTech

Backend and data engineering focused on scalable systems, automation, and AI.

Navigation

Copyright 2026 OshyTech. All Rights Reserved