This is the Data: A Mandalorian Guide to Web Scraping Best Practices

Dominik Vach

Founder & CEO

Tutorial

May 31, 2023

This is the Data: A Mandalorian Guide to Web Scraping Best Practices

In a galaxy not so far away, web developers, data scientists, and Mandalorian fans are on a quest to gather valuable data from the vast expanse of the internet. Therefore, in this action-packed guide, we will explore the best practices for web scraping, using Mandalorian-inspired wisdom to help you navigate the challenges and emerge victorious.

1. Honor the Code: Respecting Website Terms of Service and Robots.txt Rules

Just as a Mandalorian follows a strict code of honor, a skilled web scraper must respect the website's terms of service and adhere to the rules specified in the robots.txt file. This sacred text, located in the root directory of a website, guides your scraping endeavors, specifying which parts of the website are off-limits to automated access. Disregarding these rules is akin to removing your Mandalorian helmet in public – a serious breach of etiquette. Among all web scraping best practices, this one is the most important.

scraping best practices — The Mandalorian concept art by Anton Grandert, source: StarWars.com

To get the robots.txt file on a website, you'll need to follow these steps:

Open your preferred web browser.
In the address bar, type the website URL you want to access the robots.txt file for. For example, if you want to access the robots.txt file for Disney+, type "disneyplus.com" into the address bar.
After the website URL, add "/robots.txt" to the end of the address. Using the previous example, you would type "disneyplus.com/robots.txt" into the address bar.
Press Enter or Return on your keyboard to load the page.

If the website has a robots.txt file, it should now be displayed in your browser. The contents of the robots.txt file provide information about which sections of the website web crawlers (like search engine bots) are allowed or disallowed to access.

Keep in mind that not all websites have a robots.txt file, and if the file does not exist, you will likely see a 404 error page or a similar error message.

2. Tread Lightly: Efficiently and Responsibly Managing Requests and Server Load

As a stealthy Mandalorian warrior, you must minimize your footprint when infiltrating enemy territory. Similarly, web scrapers should limit the rate of their requests to avoid overwhelming the server and potentially causing a disturbance in the Force (i.e., the website crashing). Implementing techniques such as rate limiting, using caching, and avoiding repeated requests will keep you in the shadows and prevent detection.

Let’s explore the way

Below, I can provide an example using Python and the requests library to demonstrate how to implement rate limiting and responsibly manage requests when web scraping. This example assumes you have a basic understanding of Python and have the requests library installed.

Step 1: Install the requests library if you haven't already:

pip install requests

Step 2: Create a Python script and import the necessary libraries.

import requests
import time 
from random import randint

Step 3: Define a function to make requests with rate limiting

def get_web_page(url, delay_min=1, delay_max=3):
    # Add a random delay between requests to avoid overwhelming the server      
    time.sleep(randint(delay_min, delay_max))

    # Make the request and handle exceptions
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        else:
            print(f"Error {response.status_code}: Unable to fetch {url}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"RequestException: {e}")
        return None

Step 4: Use the function to scrape a list of URLs

urls_to_scrape = [
    "https://www.forloop.ai/blog/web-scraping",
    "https://www.forloop.ai/blog/external-data-privacy",
    "https://www.forloop.ai/blog/external-data",
    # ... add more URLs as needed
]

for url in urls_to_scrape:
    html_content = get_web_page(url)
    if html_content:
        # Process the HTML content as needed
        pass

This example demonstrates how to make web scraping requests with rate limiting by adding a random delay between each request using the time.sleep() function. The get_web_page() function takes an optional delay_min and delay_max parameter to control the delay range. It then fetches the web page content using the requests library and returns the HTML content if the request is successful. If an error occurs, it will print the error and return None. You can adjust the delay range or the list of URLs to scrape as needed.

With this approach, you can ensure that your web scraper is making requests at a responsible rate, reducing the likelihood of overwhelming the server or being detected. It’s very important when talking about web scraping best practices.

3. Arm Yourself: Using Web Scraping Libraries and Tools Effectively

Just as a Mandalorian relies on their arsenal of weapons and gadgets, a successful web scraper must utilize the proper libraries and tools. Python offers a variety of powerful web scraping tools, such as BeautifulSoup, Requests, and Selenium, each with its unique advantages. Equip yourself with the right weapons for the job, and you'll be ready to face any challenge.

mandalorian web scraping — The Mandalorian concept art by Doug Chiang and John Park, source: StarWars.com

4. Confronting the JavaScript Menace: Navigating JavaScript Challenges and Dynamic Content

The ever-present threat of JavaScript lurks in the shadows, making web scraping a more complex endeavor. When a website uses JavaScript to load dynamic content, traditional web scraping methods may fall short. Fear not, Mandalorian! With tools like Selenium or Puppeteer at your side, you can interact with JavaScript and capture the elusive data you seek.

As an example, we will compare BautifulSoup and Selenium by scraping the price of the Mandalorian figure.

Scraping using BeautifulSoup

To scrape the price of the figure using Python and Beautiful Soup, you'll first need to install the requests and beautifulsoup4 libraries if you haven't already:

pip install requests beautifulsoup4

Next, you can create a Python script to fetch the page content and extract the price using Beautiful Soup:

import requests
from bs4 import BeautifulSoup


url = "https://www.sideshow.com/collectibles/star-wars-the-mandalorian-and-grogu-deluxe-version-hot-toys-908289"


response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, 'html.parser')

price_container = soup.find('span', {'class': 'price-box--regular-price'})

if price_container:
    price = price_container.get_text()
    print(f"The price of the figure is: {price}")
else:
    print("Price not found.")

This script sends a request to the specified URL, parses the HTML content with Beautiful Soup, and searches for the span tag with the price-box--regular-price class, which contains the price information. If the price container is found, it prints the figure's price.

Scraping using Selenium

To extract the price using Selenium, you'll need to install the selenium library and a compatible web driver (e.g., ChromeDriver for Google Chrome or GeckoDriver for Mozilla Firefox). This example will use ChromeDriver.

First, you need to install the selenium library. In order to do it, please go for the installation tutorial that suits your operating system and web browser. For Ubuntu and chrome, you can follow this tutorial.

Later create a python script.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By


url = "https://www.sideshow.com/collectibles/star-wars-the-mandalorian-and-grogu-deluxe-version-hot-toys-908289"


chrome_options = Options()
# Uncomment the following line to run Chrome in headless mode
# chrome_options.add_argument("--headless")

driver = webdriver.Chrome(executable_path='/path/to/chromedriver', options=chrome_options)

driver.get(url)

try:
    price_element = driver.find_element(By.CLASS_NAME, 'price-box--regular-price')
    if price_element:
        price = price_element.text
        print(f"The price of the figure is: {price}")
    else:
        print("Price not found.")
except Exception as e:
    print(f"Error: {e}")

driver.quit()

Replace /path/to/chromedriver with the actual path to your downloaded ChromeDriver executable. This script will open a Chrome browser window, navigate to the specified URL, and extract the price using the CSS selector span with class price-box--regular-price. If the price element is found, it prints the price of the shirt.

If you want to run the script in headless mode (without opening the browser window), uncomment the line chrome_options.add_argument("--headless").

So far we covered 4 scraping best practices, let’s jump to the next one.

5. Organize Your Spoils: Storing and Organizing Scraped Data in a Structured Format

Once you've successfully extracted the coveted data, it's crucial to store and organize it in a structured format, such as JSON, CSV, or a database. This will make it easier to analyze, visualize, and use the data in your future missions. Remember, a well-organized Mandalorian is a successful Mandalorian.

scraping best practices mandalorian way — When you managed to scrape your first website.

6. Preserve the Bounty: Ensuring Data Accuracy and Integrity

As a Mandalorian, you take pride in delivering your bounties in pristine condition. In the realm of web scraping, this means ensuring the accuracy and integrity of your data. By implementing data validation, cleaning, and preprocessing techniques, you can maintain high-quality data that will serve you well in your future endeavors.

7. Adapt and Overcome: Adapting to Website Changes and Maintaining Scraping Scripts

Websites evolve and change, much like the galaxy itself. A skilled Mandalorian web scraper must be prepared to adapt and modify its scraping scripts when faced with updated website structures, altered elements, or new security measures. Stay vigilant, and always be ready to adjust your strategy to stay ahead of the game.

There are lot of web scraping best practices. However, by following these Mandalorian-inspired best practices, you'll be well on your way to becoming a web scraping legend. Remember, "This is the way". Join our slack channel or check other tutorials on our Forloop website. Embrace the way of the Mandalorian, and may the Force be with you on your data extraction adventures.

This is the Data: A Mandalorian Guide to Web Scraping Best Practices

1. Honor the Code: Respecting Website Terms of Service and Robots.txt Rules

2. Tread Lightly: Efficiently and Responsibly Managing Requests and Server Load

Let’s explore the way

3. Arm Yourself: Using Web Scraping Libraries and Tools Effectively

4. Confronting the JavaScript Menace: Navigating JavaScript Challenges and Dynamic Content

Scraping using BeautifulSoup

Scraping using Selenium

5. Organize Your Spoils: Storing and Organizing Scraped Data in a Structured Format

6. Preserve the Bounty: Ensuring Data Accuracy and Integrity

7. Adapt and Overcome: Adapting to Website Changes and Maintaining Scraping Scripts

Continue reading

Nicotine Pouches E-commerce Report - May 2024

Introduction to Market Intelligence

Company

Product

Solutions