Are you a budding web developer, a savvy data scientist, or a curious technology enthusiast interested in diving into the world of web scraping? If so, this guide is tailored just for you. In this comprehensive tutorial, we'll introduce you to Scrapy, an open-source web crawling framework that will help you navigate web scraping tasks like a pro.
Web scraping, the automated method of extracting large amounts of data from websites, is a crucial skill in today's data-driven world. Whether you're extracting customer reviews for sentiment analysis or mining e-commerce sites for competitive analysis, web scraping has countless applications. One tool that makes this task much more manageable is Scrapy.
Let's begin our journey toward mastering this fast and powerful web scraping tool.
Scrapy is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also extract data using APIs or as a general-purpose web crawler.
A standout feature of Scrapy is its speed. Unlike other tools that send a new request after the previous one has been handled, it uses an asynchronous networking library, allowing it to handle multiple requests concurrently. This makes it faster and more efficient, especially when dealing with large-scale scraping tasks.
To start using Scrapy, we need to install it. But before that, make sure you have Python and pip installed. Once you've confirmed that, open your terminal or command prompt and type the following command:
pip install scrapy
If the installation is successful, you can confirm by typing:
scrapy version
You should see the installed version displayed. For us it’s 2.9.0.
One of Scrapy's strengths lies in its well-thought-out architecture, which comprises several components working together to scrape web pages, making the tool highly customizable and flexible.
The data flow in Scrapy happens as follows: the engine gets the initial requests from spiders, sends them to the scheduler, and asks for the next request to send to the downloader. Once a page is downloaded, the response is sent to the spider that issued the request to parse it. The parsed items are sent to the item pipeline, and any follow-up requests to the scheduler.
Now that we have Scrapy installed and understand its architecture, it's time to get our hands dirty. The first step is to create a new Scrapy project. Open your terminal or command prompt and navigate to the directory where you want to store your project. Then type:
scrapy startproject forloop
"forloop" is the name of your project. You can choose any name that suits your preference.
Now let's create our first spider. But first, what is a spider? In Scrapy, a spider is a class that defines how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e., follow links) and how to extract structured data from their pages. Essentially, it's where you define your scraping rules.
To create a spider, navigate to the spiders directory in your project folder:
cd forloop/forloop/spiders
Then, you can create a spider using the genspider
command followed by the name of the spider and the domain (without www
or https
) you wish to scrape:
scrapy genspider myspider forloop.ai/blog
This command generates a spider named myspider
that will be used to scrape https://www.forloop.ai/blogs
.
Your spider will look something like this:
import scrapy
class MyspiderSpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['forloop.ai/blog']
start_urls = ['<https://www.forloop.ai/blog>']
def parse(self, response):
pass
You can define how the spider should download and extract data in the parse
method.
The Scrapy shell is an interactive shell where you can try and debug your scraping code quickly without running the spider. It's a helpful tool for testing your XPath or CSS expressions to extract data.
To start the Scrapy shell, use the shell
command followed by a URL you are interested in:
scrapy shell '<https://www.forloop.ai/blog>'
In the shell, you can try extracting data using the response
object:
response.xpath('//title/text()').get()
This command will extract the title of the page. In our case that will be Blog : : Forloop
.
In order to exit the shell you just need to write the following command.
quit()
When it comes to working with Scrapy, the most crucial tasks involve extracting data from web pages and storing it in a usable format. Here's how you can do it, along with clear steps on where and how you can run your code.
First, you need to create a spider. A Scrapy spider is a Python class that you define, and it tells Scrapy how to traverse a site and extract the data.
Let's create a simple spider for a hypothetical blog site - forloop.ai/blog
by creating a python file named blog_spider.py
inside the spiders
directory, and put the following code into it:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['<https://www.forloop.ai/blog>']
def parse(self, response):
items = response.css('div.article-item') # Select the div with class "article-item"
for item in items:
title = item.css('h4::text').get() # Extract the title text
link = item.css('a::attr(href)').get() # Extract the link
image = item.css('img::attr(src)').get() # Extract the image source
date = item.css('div.blog-post-date::text').get() # Extract the date
yield {
'title': title,
'link': link,
'image': image,
'date': date
}
This simple BlogSpider
will scrape the blog titles from forloop.ai/blog
In the parse
method above, we're using CSS selectors to extract data. Scrapy uses selectors to extract the data that you need from web pages. In this case, div.article-item
is a CSS selector that matches the blog titles on the web page.
When the parse
method is called, it returns an iterable of Requests and items. This is where the data extraction happens. The line yield {'title': title, 'link': link, 'image': image, 'date': date}
extracts the blog title and returns it as an item.
Once the data is extracted, you can store it in a file. By default, Scrapy provides support to export the scraped data in various formats such as JSON, XML, and CSV. To store the scraped data, you need to run the crawl command followed by the spider name and the desired output format. This command needs to be run from the root of your project directory, where the scrapy.cfg
file is located.
For example, to store the data in JSON format, open your command line, navigate to your project's root directory, and run:
scrapy crawl blogspider -o result.json
This command will run the spider named 'blogspider', and the -o
option specifies the name and format of the file where the scraped data will be stored, in this case, result.json
.
And there you have it! You've just scraped your first website using Scrapy and stored the data in a JSON file. Now you can open up result.json
in your project directory. The results should look as follows.
Scrapy handles parallel requests inherently. To control the number of concurrent requests, you can modify the CONCURRENT_REQUESTS
setting in the Scrapy configuration. For example, CONCURRENT_REQUESTS = 20
allows for 20 simultaneous requests.
Create your own middleware to handle custom scenarios. For instance, you can design a middleware to rotate user agents. The middleware might look like this:
import random
class RandomUserAgentMiddleware(object):
def __init__(self, user_agents):
self.user_agents = user_agents
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings.getlist('USER_AGENTS'))
def process_request(self, request, spider):
request.headers.setdefault('User-Agent', random.choice(self.user_agents))
Don't forget to update USER_AGENTS
and DOWNLOADER_MIDDLEWARES
settings in your project.
Implement checks in your parsing function to handle potential errors. For instance, to handle missing fields in the item:
def parse(self, response):
item = MyItem()
item['field'] = response.css('div.field::text').get(default='Missing')
return item
Use the CrawlSpider
class when you need to follow links and scrape data across an entire site. For example:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'mysite.com'
allowed_domains = ['mysite.com']
start_urls = ['<http://www.mysite.com>']
rules = (
Rule(LinkExtractor(), callback='parse_item'),
)
def parse_item(self, response):
# parsing code here
Utilize the LinkExtractor
to follow links in a page. For instance, to follow all links to product pages you could use:
Rule(LinkExtractor(allow='/product/'), callback='parse_item')
Use item loaders to simplify the extraction process. They provide input and output processors to clean up your data. For example:
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.default_output_processor = TakeFirst()
l.add_css('name', 'div.product_name')
l.add_css('price', 'p.price::text', MapCompose(str.strip, str.replace, args=(',', '')))
l.add_css('description', 'div.description::text', Join())
return l.load_item()
To instruct Scrapy to respect the rules in robots.txt
, set the ROBOTSTXT_OBEY
setting to True
in your project's settings.
With these practical tips and examples, you'll be more equipped to handle the challenges you might face in your scraping journey with Scrapy.
Scrapy is a powerful and versatile tool for web scraping, providing robust and efficient methods to extract data from the web. Whether you're a beginner starting your journey into web scraping or an experienced developer seeking to enhance your skills, this guide provided a comprehensive introduction to Scrapy's fundamental concepts and practical tips to enhance your scraping techniques.
We walked through the installation process, discussed the architecture of Scrapy, and took a step-by-step journey through creating a Scrapy project and building a spider. We dove into the power of Scrapy shell for testing, explored the extraction and storage of data, and offered valuable tips for best practices.
In addition, we delved into some unofficial tips and tricks for using Scrapy, demonstrating how it can be leveraged for growth hacking and practical development scenarios, with short examples to illustrate the concepts. These tips are intended to inspire and guide you in extracting the most out of this dynamic framework.
Remember, web scraping should always be done responsibly, respecting the website's policies and user privacy. As you continue to experiment with Scrapy, you'll discover that it's a highly customizable tool that can handle complex scraping tasks, and its potential is limited only by your creativity.
We hope this guide has been a useful introduction to Scrapy. Now it's over to you to put these concepts into practice and start your web scraping journey. If you have any questions or want to share your experiences, join our Slack. Keep exploring, keep learning, and most importantly, happy scraping!