⚙️Web Scraping

Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Selenium Python

Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.

Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, Ie, Chrome, Remote etc. The current supported Python versions are 3.5 and above.

This documentation explains Selenium 2 WebDriver API. Selenium 1 / Selenium RC API is not covered here.

1.2. Installing Python bindings for Selenium

Use pip to install the selenium package. Python 3 has pip available in the standard library. Using pip, you can install selenium like this:

pip install selenium

You may consider using virtualenv to create isolated Python environments. Python 3 has venv which is almost the same as virtualenv.

You can also download Python bindings for Selenium from the PyPI page for selenium package. and install manually.

Drivers

Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before the below examples can be run. Make sure it’s in your PATH, e. g., place it in /usr/bin or /usr/local/bin.

Failure to observe this step will give you an error selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.

Other supported browsers will have their own drivers available. Links to some of the more popular browser drivers follow.

Chrome:

https://sites.google.com/chromium.org/driver/

Edge:

https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

Firefox:

https://github.com/mozilla/geckodriver/releases

Safari:

https://webkit.org/blog/6900/webdriver-support-in-safari-10/

For more information about driver installation, please refer the official documentation

Simple Usage

If you have installed Selenium Python bindings, you can start using it from Python like this.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

Example Explained

The selenium.webdriver module provides all the WebDriver implementations. Currently supported WebDriver implementations are Firefox, Chrome, IE and Remote. The Keys class provide keys in the keyboard like RETURN, F1, ALT etc.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

Next, the instance of Firefox WebDriver is created.

driver = webdriver.Firefox()

The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded (that is, the “onload” event has fired) before returning control to your test or script. Be aware that if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded:

driver.get("http://www.python.org")

The next line is an assertion to confirm that title has “Python” word in it:

assert "Python" in driver.title

WebDriver offers a number of ways to find elements using one of the find_element_by_* methods. For example, the input text element can be located by its name attribute using find_element_by_name method. A detailed explanation of finding elements is available in the Locating Elements chapter:

elem = driver.find_element_by_name("q")

Next, we are sending keys, this is similar to entering keys using your keyboard. Special keys can be sent using Keys class imported from selenium.webdriver.common.keys. To be safe, we’ll first clear any pre-populated text in the input field (e.g. “Search”) so it doesn’t affect our search results:

elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

After submission of the page, you should get the result if there is any. To ensure that some results are found, make an assertion:

assert "No results found." not in driver.page_source

Finally, the browser window is closed. You can also call quit method instead of close. The quit will exit entire browser whereas close will close one tab, but if just one tab was open, by default most browser will exit entirely.:

driver.close()

https://selenium-python.readthedocs.io/index.html

Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

Installing Beautiful Soup

If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:

apt-get install python3-bs4 (for Python 3)

Beautiful Soup 4 is published through PyPI, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3. Make sure you use the right version of pip or easy_install for your Python version (these may be named pip3 and easy_install3 respectively if you’re using Python 3).

easy_install beautifulsoup4

pip install beautifulsoup4

The BeautifulSoup package is not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4.

If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.

python setup.py install

If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all.

Example: https://www.mediamarkt.nl/

import requests
from bs4 import BeautifulSoup

url="https://www.mediamarkt.nl/"
html=requests.get(url)

print(html) //<Response[200]>

html=requests.get(url).content
print(html)

<Response[200]>

b’<!DOCTYPE html>
<!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7"><![endif]-->
<!--[if IE 7]><html class="no-js lt-ie9 lt-ie8 ie7"><![endif]-->
<!--[if IE 8]><html class="no-js lt-ie9 ie8"><![endif]-->
<!--[if gt IE 8]><!--><html class="no-js"><!--<![endif]-->
<head>
<title>MediaMarkt</title>
<meta charset="utf-8" />
<link rel="canonical" href="https://www.mediamarkt.nl/" />
<meta property="pageTypeId" content="" data-id="home" />
<link rel="alternate" hreflang="nl-NL" href="https://www.mediamarkt.nl/" />
<meta name="description" content="MediaMarkt is de nummer
&eacute;&eacute;n winkelketen voor consumentenelektronica in Europa. Niet alleen
het grootste assortiment onder &eacute;&eacute;n dak, maar ook altijd de nieuwste
en innovatiefste producten." />
<meta name="robots" content="index,follow" />
<meta name="google-site-verification" content="iummHD8QhQJsCT42KhsL1s36YwVG81ZpSLyAVQ06rM"
/>
<script type="text/javascript"
src="/dt/ruxitagentjs_ICA2dfgjmqru_10207210127152629.js" datadtconfig="
app=0fe7eb5fcee005af|cuc=ikduodw0|mel=100000|featureHash=ICA2dfgjmqr
u|lastModification=1612284241790|dtVersion=10207210127152629|tp=500,50,0,1|rdn
t=1|uxrgce=1|uxdcw=1500|vs=2|agentUri=/dt/ruxitagentjs_ICA2dfgjmqru_1020721012
7152629.js|reportUrl=/dt/rb_0ca29162-4b29-4a47-946a-081c13f47ee3|rid=RID_-
1677218119|rpid=-963955369|domain=mediamarkt.nl"></script><link rel="appletouch-
icon" sizes="57x57" href="/apple-touch-icon-57x57.png">
<link rel="icon" type="image/png" href="/android-chrome-192x192.png"
sizes="192x192">
<link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16">
<link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32">
<link rel="manifest" href="/manifest.json">
<link rel="mask-icon" href="/safari-pinned-tab.svg" color="#000000">
<meta name="msapplication-TileColor" content="#000000">
<meta name="msapplication-TileImage" content="/mstile-144x144.png">
<meta property="product-container" content="/nl/productcontainer/products.json"
data-param="catEntryId" datacallback="
mcs.productContainer.initProducts" />
<meta property="agerating" content="" data-cb-showlinks="
mcs.displayMediaPlayerLinks"
data-cb-display-layer="mcs.displayAgeRatingLayer" data-cbopen-
player="mcs.openMediaPlayer" /> ...continue

Installing a Parser

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:

apt-get install python-lxml  
easy_install lxml 
pip install lxml

Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

apt-get install python-html5lib
easy_install html5lib
pip install html5lib

Making The Soup

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    soup = BeautifulSoup("<html>a web page</html>", 'html.parser')

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

print(BeautifulSoup("<html><head></head><body>Sacr&eacute; bleu!</body></html>",
"html.parser"))
# <html><head></head><body>Sacré bleu!</body></html>

Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)

Let's continue our example; https://www.mediamarkt.nl/

soup=BeautifulSoup(html,"html.parser")
print(soup.prettify())

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string

You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects: -Since it adds whitespace (in the form of newlines), prettify() changes the meaning of an HTML document and should not be used to reformat one.

The goal of prettify() is to help you visually understand the structure of the documents you work with.

soup.prettify()

<!DOCTYPE html>
<!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7"><![endif]-->
<!--[if IE 7]><html class="no-js lt-ie9 lt-ie8 ie7"><![endif]-->
<!--[if IE 8]><html class="no-js lt-ie9 ie8"><![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js">
<!--<![endif]-->
<head>
<title>
MediaMarkt
</title>
<meta charset="utf-8"/>
<link href="https://www.mediamarkt.nl/" rel="canonical"/>
<meta content="" data-id="home" property="pageTypeId"/>
<link href="https://www.mediamarkt.nl/" hreflang="nl-NL" rel="alternate"/>
<meta content="MediaMarkt is de nummer één winkelketen voor consumentenelekt
ronica in Europa. Niet alleen het grootste assortiment onder
één dak, maar ook altijd de nieuwste en innovatiefste producten." name="descri
ption"/>
<meta content="index,follow" name="robots"/>
<meta content="ium-mHD8QhQJsCT42KhsL1s36YwVG81ZpSLyAVQ06rM" name="googlesite-
verification"/>
<script datadtconfig="
app=0fe7eb5fcee005af|cuc=ikduodw0|mel=100000|featureHash=ICA2dfgjmqr
u|lastModification=1612284241790|dtVersion=10207210127152629|tp=500,50,0,1|rdn
t=1|uxrgce=1|uxdcw=1500|vs=2|agentUri=/dt/ruxitagentjs_ICA2dfgjmqru_1020721012
7152629.js|reportUrl=/dt/rb_0ca29162-4b29-4a47-946a-081c13f47ee3|rid=RID_-
1677218119|rpid=209441690|domain=mediamarkt.nl" src="/dt/ruxitagentjs_ICA2dfgj
mqru_10207210127152629.js" type="text/javascript">
</script>
<link href="/apple-touch-icon-57x57.png" rel="apple-touchicon"
sizes="57x57"/>
<link href="/apple-touch-icon-60x60.png" rel="apple-touchicon"
sizes="60x60"/>
<link href="/apple-touch-icon-72x72.png" rel="apple-touchicon"
sizes="72x72"/>
<link href="/apple-touch-icon-76x76.png" rel="apple-touchicon"
sizes="76x76"/>
<link href="/apple-touch-icon-114x114.png" rel="apple-touchicon"
sizes="114x114"/>
<link href="/apple-touch-icon-120x120.png" rel="apple-touchicon"
sizes="120x120"/>
<link href="/apple-touch-icon-144x144.png" rel="apple-touchicon"
sizes="144x144"/>
<link href="/apple-touch-icon-152x152.png" rel="apple-touchicon"
sizes="152x152"/>….
….
….
…
</span>
Klantenservice
</a>
</li>
<li class="ms-list__item" data-identifier="meta-navigation-prio1link">
<a class="ms-link ms-header2__meta-nav-link" datatracking="
MediaMarkt Club" href="//www.mediamarkt.nl/nl/shop/mediamarkt-clubinformatie.
html" ontouchstart="">
<span class="ms-icon ms-icon--type_star ms-text--icon--default">
</span>
MediaMarkt Club
</a>
</li>
<li class="ms-list__item ms-header2__meta-nav-list-item" dataidentifier="
marketSelectorPrio1">
<div class="ms-market-selector ms-market-selector--meta-nav">
<span class="ms-dropdown ms-market-selector__dropdown">
<!-- market selector dropdown trigger -->
<span class="ms-dropdown__trigger ms-market-selector__dropdowntrigger
ms-market-selector__dropdown-trigger--meta-nav">
<a class="ms-market-selector__button ms-market-selector__button--
meta-nav" data-identifier="ms-market-selector__button--meta-nav">
<span class="ms-market-selector__button-label ms-marketselector__
button-label--meta-nav" data-identifier="ms-market-selector__buttonlabel--
meta-nav">
<span>
Zoek winkel
</span>
</span>
</a>
<button class="ms-market-selector__panel-toggle ms-marketselector__
panel-toggle--meta-nav">
<span class="ms-market-selector__panel-toggle-icon ms-marketselector__
panel-toggle-icon--met

print(soup.title)
// <title>MediaMarkt</title>

print(soup.title.string)
// MediaMarkt

print(soup.title.text)
// MediaMarkt

For example, let's try to reach "Categorieen" article.

print(soup.find_all("h4"))

[<h4 class="ms-link-list__title ms-text--medium" data-toggle="linklist">
Categorieën</h4>, <h4 class="ms-link-list__title ms-text--medium" datatoggle="
link-list">Bekijk al onze merken</h4>, <h4 class="ms-link-list__title 
ms-text--medium" data-toggle="link-list">Over MediaMarkt</h4>, <h4 class="mslink-
list__title ms-text--medium" data-toggle="link-list">Nieuws</h4>, <h4
class="ms-link-list__title ms-text--medium" data-toggle="linklist">
Klantenservice</h4>, <h4 class="ms-link-list__title ms-text--medium"
data-toggle="link-list">Services</h4>, <h4 class="ms-link-list__title ms-text-
-medium" data-toggle="link-list">Nieuwsbrief</h4>, <h4 class="ms-linklist__
title ms-text--medium" data-toggle="link-list">MediaMarkt-app</h4>, <h4
class="ms-link-list__title ms-text--medium" data-toggle="link-list">Volg
ons</h4>, <h4 class="ms-link-list__title ms-text--medium" data-toggle="linklist">
Veilig winkelen</h4>]

<h4 
class="ms-link-list__title ms-text--medium ms-link-list__title--active" data-toggle="linklist">
Categorieën
</h4>

Let's do a little more special function:

print(soup.find_all("h4", class_="ms-link-list__title ms-text--
medium",attrs={"data-toggle": "link-list"}))


for i in ctg:
    print(i.text)

[<h4 class="ms-link-list__title ms-text--medium" data-toggle="linklist">
Categorieën</h4>, <h4 class="ms-link-list__title ms-text--medium" datatoggle="
link-list">Bekijk al onze merken</h4>, <h4 class="ms-link-list__title
ms-text--medium" data-toggle="link-list">Over MediaMarkt</h4>, <h4 class="ms
link-list__title ms-text--medium" data-toggle="link-list">Nieuws</h4>, <h4
class="ms-link-list__title ms-text--medium" data-toggle="linklist">
Klantenservice</h4>, <h4 class="ms-link-list__title ms-text--medium"
data-toggle="link-list">Services</h4>, <h4 class="ms-link-list__title ms-text-
-medium" data-toggle="link-list">Nieuwsbrief</h4>, <h4 class="ms-linklist__
title ms-text--medium" data-toggle="link-list">MediaMarkt-app</h4>, <h4
class="ms-link-list__title ms-text--medium" data-toggle="link-list">Volg
ons</h4>, <h4 class="ms-link-list__title ms-text--medium" data-toggle="linklist">
Veilig winkelen</h4>]
ctg=soup.find_all("h4", class_="ms-link-list__title ms-text--
medium",attrs={"data-toggle": "link-list"})

Categorieën
Bekijk al onze merken
Over MediaMarkt
Nieuws
Klantenservice
Services
Nieuwsbrief
MediaMarkt-app
Volg ons
Veilig winkelen

From now on you can continue with the Python functions. For example:

print(ctg[0].text)
//Categorieën

For "Winkels":

print(soup.find(attrs={"data-tracking":"Winkels"}).text.strip())

//Winkels

For "Cadeaukaart":

print(soup.find("a",class_="ms-link", 
attrs={"datatracking":"Cadeaukaart"}).text.strip())

//Cadeaukaart

Scrapy

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Analyzing robots.txt

Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.

robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type https://www.google.com/robots.txt and we will get something as follows:

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
...

Some of the most common rules that are defined in a website’s robots.txt file are as follows:

User-agent: BadCrawler 
Disallow: /

The above rule means the robots.txt file asks a crawler with BadCrawler user agent not to crawl their website.

User-agent: *
Crawl-delay: 5
Disallow: /trap

The above rule means the robots.txt file delays a crawler for 5 seconds between download requests for all user-agents for avoiding overloading server. The /trap link will try to block malicious crawlers who follow disallowed links. There are many more rules that can be defined by the publisher of the website as per their requirements.

Installing Scrapy

Scrapy requires Python 3.6+, either the CPython implementation (default) or the PyPy 7.2.0+ implementation (see Alternate Implementations).

If you’re using Anaconda or Miniconda, you can install the package from the condaforge channel, which has up-to-date packages for Linux, Windows and macOS.

If you’re using Anaconda or Miniconda, you can install the package from the condaforge channel, which has up-to-date packages for Linux, Windows and macOS. To install Scrapy using conda, run:

conda install -c conda-forge scrapy

Alternatively, if you’re already familiar with installation of Python packages, you can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

Note that sometimes this may require solving compilation issues for some Scrapy dependencies depending on your operating system, so be sure to check the platform specific installation notes.

The minimal versions which Scrapy is tested against are: Twisted 14.0, lxml 3.4, pyOpenSSL 0.14.

Using a virtual environment (recommended)

We recommend installing Scrapy inside a virtual environment on all platforms.

Python packages can be installed either globally (a.k.a system wide), or in user-space. We do not recommend installing Scrapy system wide.

Instead, we recommend that you install Scrapy within a so-called “virtual environment” (venv). Virtual environments allow you to not conflict with already-installed Python system packages (which could break some of your system tools and scripts), and still install packages normally with pip (without sudo and the likes).

Once you have created a virtual environment, you can install Scrapy inside it with pip, just like any other Python package. (See platform-specific guides below for non-Python dependencies that you may need to install beforehand).

Platform specific installation notes

Windows

Though it’s possible to install Scrapy on Windows using pip, we recommend you to install Anaconda or Miniconda and use the package from the conda-forge channel, which will avoid most installation issues.

Once you’ve installed Anaconda or Miniconda, install Scrapy with:

conda install -c conda-forge scrapy

Ubuntu 14.04 or above

Scrapy is currently tested with recent-enough versions of lxml, twisted and pyOpenSSL, and is compatible with recent Ubuntu distributions. But it should support older versions of Ubuntu too, like Ubuntu 14.04, albeit with potential issues with TLS connections.

Don’t use the python-scrapy package provided by Ubuntu, they are typically too old and slow to catch up with latest Scrapy.

To install Scrapy on Ubuntu (or Ubuntu-based) systems, you need to install these dependencies:

sudo apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

Inside a virtualenv, you can install Scrapy with pip after that:

pip install scrapy

The same non-Python dependencies can be used to install Scrapy in Debian Jessie (8.0) and above.

macOS

Building Scrapy’s dependencies requires the presence of a C compiler and development headers. On macOS this is typically provided by Apple’s Xcode development tools. To install the Xcode command line tools open a terminal window and run:

xcode-select --install

(Optional) Install Scrapy inside a Python virtual environment.

This method is a workaround for the above macOS issue, but it’s an overall good practice for managing dependencies and can complement the first method. After any of these workarounds you should be able to install Scrapy:

pip install Scrapy

Creating a Project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject projectname

This will create a tutorial directory with the following contents:

projectname/ scrapy.cfg #deploy configuration file tutorial/ #you'll import your code from here __init__.py items.py # project items definition file items.py # project items definition file middlewares.py # project middlewares file pipelines.py # project pipelines file settings.py # project settings file spiders/ # folder you'll later put the spiders __init__.py

Our First Spider

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

This is the code for our first Spider. Save it in a file named : example_spider.py

Under the example/spiders directory in your project:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    def start_requests(self):
        urls = [
        'https://www.bol.com/nl/l/boeken/N/8299/',
        'https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f' example-{ page }.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file { filename }')

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:

name: Identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

start_requests(): Must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

parse(): A method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

How to Run Our Spider

To put our spider to work, go to the project’s top level directory and run:

..\example> scrapy crawl example

This command runs the spider with name example that we’ve just added, that will send some requests for the bol.com and kitapyurdu.com domain. You will get an output similar to this:

2021-01-12 12:01:21 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: example)
2021-01-12 12:01:21 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0,
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i 8 Dec
2020), cryptography 3.3.1, Platform Windows-10-10.0.18362-SP0
2021-01-12 12:01:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-01-12 12:01:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
'NEWSPIDER_MODULE': 'example.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['example.spiders']}
…..
2021-01-12 12:01:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bol.com/robots.txt> (referer: None)
2021-01-12 12:01:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kitapyurdu.com/robots.txt> (referer: None)
2021-01-12 12:01:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bol.com/nl/l/boeken/N/8299/> (referer: None)
2021-01-12 12:01:23 [example] DEBUG: Saved file example-8299.html
2021-01-12 12:01:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html>
(referer: None)
2021-01-12 12:01:23 [example] DEBUG: Saved file example-kitap-edebiyat.html
2021-01-12 12:01:23 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-12 12:01:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 929,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 113064,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 1, 12, 11, 1, 23, 939458),
'log_count/DEBUG': 6,
'log_count/INFO': 10,
'response_received_count': 4,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/200': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2021, 1, 12, 11, 1, 22, 740037)}
2021-01-12 12:01:24 [scrapy.core.engine] INFO: Spider closed (finished)
/example>

Now, check the files in the current directory. You should notice that two new files have been created:

example-bol-1.html and example-kitapyurdu-2.html, with the content for the respective URLs, as our parse method instructs.

If you are wondering why we haven’t parsed the HTML yet, hold on, we will cover that soon.

example-bol-1.html

<!DOCTYPE html>
<html class="no-js is-desktop"
lang="nl-NL">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1 user-scalable=no">
<meta name="format-detection" content="telephone=no">
<link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png?v=476aOAdO8j">
<link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png?v=476aOAdO8j">
<link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png?v=476aOAdO8j">
<link rel="manifest" href="/site.webmanifest?v=476aOAdO8j">
<link rel="mask-icon" href="/safari-pinned-tab.svg?v=476aOAdO8j" color="#0000a4">
<link rel="shortcut icon" href="/favicon.ico?v=476aOAdO8j">
<meta name="apple-mobile-web-app-title" content="Bol.com">
<meta name="application-name" content="Bol.com">
<meta name="msapplication-TileColor" content="#0000a4">
<meta name="theme-color" content="#0000a4">
<link rel="canonical" href="https://www.bol.com/nl/l/boeken/N/8299/">
<link rel="alternate" hreflang="nl-BE" href="https://www.bol.com/be/l/boeken/N/8299/">
<link rel="alternate" hreflang="nl-NL" href="https://www.bol.com/nl/l/boeken/N/8299/">

kitapyurdu-2.html

<!DOCTYPE html>
<html dir="ltr" lang="tr">
<head>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-NHPLZ5F');</script>
<!-- End Google Tag Manager -->
<meta charset="UTF-8" />
<meta name="robots" content="noarchive" />
<meta name="robots" content="index,follow" />
<title> Edebiyat Kitapları - En Yeni ve En Çok Satan Edebiyat Kitapları - kitapyurdu.com</title>
<base href="https://www.kitapyurdu.com/" />
<!--[if IE]><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><![endif]-->
<meta name="google-play-app" content="app-id=com.mobisoft.kitapyurdu">
<meta name="apple-itunes-app" content="app-id=489855982">
<meta name="msvalidate.01" content="DB7793DB335E4894B3A42C1CC528CBD4" />
<meta name="description" content="Edebiyat kategorisinde çok satan yeni çıkan ve tüm kitaplara hızlıca
ulaşıp satın alabilirsiniz." />
<meta name="keywords" content="Edebiyat,Edebiyat kategorisine ait kitaplar, Edebiyat kitapları, son çık
an Edebiyat kitapları, en ucuz Edebiyat, en yeni Edebiyat ürünleri, en cok satan Edebiyat kitapları,
Edebiyat kitapları yorumları" />
<link href="https://img.kitapyurdu.com/v1/getImage/fn:11194713/wh:cde76d960" rel="shortcut icon" />
<link rel="manifest" href="manifest.json" />
<link rel="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.min.css?4
.2020.12.01-d10" />
<link rel="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/jBox.css?4.2020.12.0
1-d10" />
<link rel="stylesheet" type="text/css" href="catalog/view/javascript/jquery/rating/jquery.rating.css?4.
2020.12.01-d10" media="screen" />
<link rel="stylesheet" type="text/css" href="catalog/view/javascript/jquery/ui-1.12.1/jqueryui.
min.css" />
<link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/fontawesome.
min.css" rel="stylesheet"/>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>

What just happened under the hood?

Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon receiving a response for each one, it instantiates Response objects and calls the callback method associated with the request (in this case, the parse method) passing the response as argument.

A shortcut to the start_requests method:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://www.bol.com/nl/l/boeken/N/8299/',
        'https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html',
        ]

Extracting Data

With Scrapy shell:

Using the shell, you can try selecting elements using CSS with the response object:

Find Bol.com’s title;

CSS code:

XPath:

Besides CSS, Scrapy selectors also support using XPath expressions:

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the shell.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.

Example-1: Find Book Names on bol.com

Look for titles in example-bol-1.htmlWhere are they?

<div class="produc-title--inline"> <a class="product-title"> De spin die veel scheetjes laat </div>

All titles are in the same place.

Get text information;

response.css(“div.classname a.classname :: text”).get() or getall()

response.css(“div.product-title—inline a.product-title :: text”).get()

Example-2: Find Book Names, Author Names and Price on https://www.kitapyurdu.com/kategori/kitapedebiyat/128.html

Look at website! You can see book names, author names and price details on screen. If you click the "view" button, you can view the html codes like this:

Type this on your terminal:

You can view the result and interact with the console:

You can find book names, author names and price details within <li class="mg-b-10">...</li> block, so type the following in the console:

>>>response.css(“li.mg-b-10”)

>>>products=response.css(“li.mg-b-10”)
>>>for i in products:
...    print("Name= ", i.css("span ::text").get())
...    print("Author= ", i.css("div.author.compact.ellipsis a ::text").get())
...    print("Price= ", i.css("div.price-new span.value ::text").get())

Extracting Data in our Spider

Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the yield Python keyword in the callback, as you can see below:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls=["https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html"]
    
    def parse(self, response):
        products=response.css("li.mg-b-10")
        for products in products:
            price=products.css("div.price-new span.value ::text").extract()
            author=products.css("div.author.compact.ellipsis a ::text").extract()
            book_name=products.css("div.name.ellipsis span ::text").extract()
            
            yield {"price":price,"author":author,"book_name":book_name

In your terminal type the below code and press enter:

scrapy crawl example -o example.json

After executing the code, it writes all scraped data to 'example.json' file:

PreviousModule Project(AutoScout24 Application)NextWeek28-2

Last updated 2 years ago

Was this helpful?