⚙️Web Scraping
Web Scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Selenium Python
Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.
Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, Ie, Chrome, Remote etc. The current supported Python versions are 3.5 and above.
This documentation explains Selenium 2 WebDriver API. Selenium 1 / Selenium RC API is not covered here.
1.2. Installing Python bindings for Selenium
Use pip to install the selenium package. Python 3 has pip available in the standard library. Using pip, you can install selenium like this:
pip install selenium
You may consider using virtualenv to create isolated Python environments. Python 3 has venv which is almost the same as virtualenv.
You can also download Python bindings for Selenium from the PyPI page for selenium package. and install manually.
Drivers
Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before the below examples can be run. Make sure it’s in your PATH, e. g., place it in /usr/bin or /usr/local/bin.
Failure to observe this step will give you an error selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.
Other supported browsers will have their own drivers available. Links to some of the more popular browser drivers follow.
For more information about driver installation, please refer the official documentation
Simple Usage
If you have installed Selenium Python bindings, you can start using it from Python like this.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
Example Explained
The selenium.webdriver module provides all the WebDriver implementations. Currently supported WebDriver implementations are Firefox, Chrome, IE and Remote. The Keys class provide keys in the keyboard like RETURN, F1, ALT etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
Next, the instance of Firefox WebDriver is created.
driver = webdriver.Firefox()
The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded (that is, the “onload” event has fired) before returning control to your test or script. Be aware that if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded:
driver.get("http://www.python.org")
The next line is an assertion to confirm that title has “Python” word in it:
assert "Python" in driver.title
WebDriver offers a number of ways to find elements using one of the find_element_by_* methods. For example, the input text element can be located by its name attribute using find_element_by_name method. A detailed explanation of finding elements is available in the Locating Elements chapter:
elem = driver.find_element_by_name("q")
Next, we are sending keys, this is similar to entering keys using your keyboard. Special keys can be sent using Keys class imported from selenium.webdriver.common.keys. To be safe, we’ll first clear any pre-populated text in the input field (e.g. “Search”) so it doesn’t affect our search results:
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
After submission of the page, you should get the result if there is any. To ensure that some results are found, make an assertion:
assert "No results found." not in driver.page_source
Finally, the browser window is closed. You can also call quit method instead of close. The quit will exit entire browser whereas close will close one tab, but if just one tab was open, by default most browser will exit entirely.:
driver.close()
https://selenium-python.readthedocs.io/index.html
Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
Installing Beautiful Soup
If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:
apt-get install python3-bs4 (for Python 3)
Beautiful Soup 4 is published through PyPI, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3. Make sure you use the right version of pip or easy_install for your Python version (these may be named pip3 and easy_install3 respectively if you’re using Python 3).
easy_install beautifulsoup4
pip install beautifulsoup4
If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.
python setup.py install
If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all.
Example: https://www.mediamarkt.nl/
import requests
from bs4 import BeautifulSoup
url="https://www.mediamarkt.nl/"
html=requests.get(url)
print(html) //<Response[200]>
html=requests.get(url).content
print(html)
Installing a Parser
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:
apt-get install python-lxml
easy_install lxml
pip install lxml
Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:
apt-get install python-html5lib
easy_install html5lib
pip install html5lib

Making The Soup
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
soup = BeautifulSoup("<html>a web page</html>", 'html.parser')
First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:
print(BeautifulSoup("<html><head></head><body>Sacré bleu!</body></html>",
"html.parser"))
# <html><head></head><body>Sacré bleu!</body></html>
Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)
Let's continue our example; https://www.mediamarkt.nl/
soup=BeautifulSoup(html,"html.parser")
print(soup.prettify())
The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string
You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects: -Since it adds whitespace (in the form of newlines), prettify() changes the meaning of an HTML document and should not be used to reformat one.
<!DOCTYPE html>
<!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7"><![endif]-->
<!--[if IE 7]><html class="no-js lt-ie9 lt-ie8 ie7"><![endif]-->
<!--[if IE 8]><html class="no-js lt-ie9 ie8"><![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js">
<!--<![endif]-->
<head>
<title>
MediaMarkt
</title>
<meta charset="utf-8"/>
<link href="https://www.mediamarkt.nl/" rel="canonical"/>
<meta content="" data-id="home" property="pageTypeId"/>
<link href="https://www.mediamarkt.nl/" hreflang="nl-NL" rel="alternate"/>
<meta content="MediaMarkt is de nummer één winkelketen voor consumentenelekt
ronica in Europa. Niet alleen het grootste assortiment onder
één dak, maar ook altijd de nieuwste en innovatiefste producten." name="descri
ption"/>
<meta content="index,follow" name="robots"/>
<meta content="ium-mHD8QhQJsCT42KhsL1s36YwVG81ZpSLyAVQ06rM" name="googlesite-
verification"/>
<script datadtconfig="
app=0fe7eb5fcee005af|cuc=ikduodw0|mel=100000|featureHash=ICA2dfgjmqr
u|lastModification=1612284241790|dtVersion=10207210127152629|tp=500,50,0,1|rdn
t=1|uxrgce=1|uxdcw=1500|vs=2|agentUri=/dt/ruxitagentjs_ICA2dfgjmqru_1020721012
7152629.js|reportUrl=/dt/rb_0ca29162-4b29-4a47-946a-081c13f47ee3|rid=RID_-
1677218119|rpid=209441690|domain=mediamarkt.nl" src="/dt/ruxitagentjs_ICA2dfgj
mqru_10207210127152629.js" type="text/javascript">
</script>
<link href="/apple-touch-icon-57x57.png" rel="apple-touchicon"
sizes="57x57"/>
<link href="/apple-touch-icon-60x60.png" rel="apple-touchicon"
sizes="60x60"/>
<link href="/apple-touch-icon-72x72.png" rel="apple-touchicon"
sizes="72x72"/>
<link href="/apple-touch-icon-76x76.png" rel="apple-touchicon"
sizes="76x76"/>
<link href="/apple-touch-icon-114x114.png" rel="apple-touchicon"
sizes="114x114"/>
<link href="/apple-touch-icon-120x120.png" rel="apple-touchicon"
sizes="120x120"/>
<link href="/apple-touch-icon-144x144.png" rel="apple-touchicon"
sizes="144x144"/>
<link href="/apple-touch-icon-152x152.png" rel="apple-touchicon"
sizes="152x152"/>….
….
….
…
</span>
Klantenservice
</a>
</li>
<li class="ms-list__item" data-identifier="meta-navigation-prio1link">
<a class="ms-link ms-header2__meta-nav-link" datatracking="
MediaMarkt Club" href="//www.mediamarkt.nl/nl/shop/mediamarkt-clubinformatie.
html" ontouchstart="">
<span class="ms-icon ms-icon--type_star ms-text--icon--default">
</span>
MediaMarkt Club
</a>
</li>
<li class="ms-list__item ms-header2__meta-nav-list-item" dataidentifier="
marketSelectorPrio1">
<div class="ms-market-selector ms-market-selector--meta-nav">
<span class="ms-dropdown ms-market-selector__dropdown">
<!-- market selector dropdown trigger -->
<span class="ms-dropdown__trigger ms-market-selector__dropdowntrigger
ms-market-selector__dropdown-trigger--meta-nav">
<a class="ms-market-selector__button ms-market-selector__button--
meta-nav" data-identifier="ms-market-selector__button--meta-nav">
<span class="ms-market-selector__button-label ms-marketselector__
button-label--meta-nav" data-identifier="ms-market-selector__buttonlabel--
meta-nav">
<span>
Zoek winkel
</span>
</span>
</a>
<button class="ms-market-selector__panel-toggle ms-marketselector__
panel-toggle--meta-nav">
<span class="ms-market-selector__panel-toggle-icon ms-marketselector__
panel-toggle-icon--met
print(soup.title)
// <title>MediaMarkt</title>
print(soup.title.string)
// MediaMarkt
print(soup.title.text)
// MediaMarkt
For example, let's try to reach "Categorieen" article.
print(soup.find_all("h4"))

<h4
class="ms-link-list__title ms-text--medium ms-link-list__title--active" data-toggle="linklist">
Categorieën
</h4>
Let's do a little more special function:
print(soup.find_all("h4", class_="ms-link-list__title ms-text--
medium",attrs={"data-toggle": "link-list"}))
for i in ctg:
print(i.text)
From now on you can continue with the Python functions. For example:
print(ctg[0].text)
//Categorieën
For "Winkels":
print(soup.find(attrs={"data-tracking":"Winkels"}).text.strip())
//Winkels
For "Cadeaukaart":
print(soup.find("a",class_="ms-link",
attrs={"datatracking":"Cadeaukaart"}).text.strip())
//Cadeaukaart
Scrapy
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
Analyzing robots.txt
Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.
robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type https://www.google.com/robots.txt and we will get something as follows:
User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
...
Some of the most common rules that are defined in a website’s robots.txt file are as follows:
User-agent: BadCrawler
Disallow: /
The above rule means the robots.txt file asks a crawler with BadCrawler user agent not to crawl their website.
User-agent: *
Crawl-delay: 5
Disallow: /trap
The above rule means the robots.txt file delays a crawler for 5 seconds between download requests for all user-agents for avoiding overloading server. The /trap link will try to block malicious crawlers who follow disallowed links. There are many more rules that can be defined by the publisher of the website as per their requirements.
Installing Scrapy
Scrapy requires Python 3.6+, either the CPython implementation (default) or the PyPy 7.2.0+ implementation (see Alternate Implementations).
If you’re using Anaconda or Miniconda, you can install the package from the condaforge channel, which has up-to-date packages for Linux, Windows and macOS.
If you’re using Anaconda or Miniconda, you can install the package from the condaforge channel, which has up-to-date packages for Linux, Windows and macOS. To install Scrapy using conda, run:
conda install -c conda-forge scrapy
Alternatively, if you’re already familiar with installation of Python packages, you can install Scrapy and its dependencies from PyPI with:
pip install Scrapy
Note that sometimes this may require solving compilation issues for some Scrapy dependencies depending on your operating system, so be sure to check the platform specific installation notes.
The minimal versions which Scrapy is tested against are: Twisted 14.0, lxml 3.4, pyOpenSSL 0.14.
Using a virtual environment (recommended)
We recommend installing Scrapy inside a virtual environment on all platforms.
Python packages can be installed either globally (a.k.a system wide), or in user-space. We do not recommend installing Scrapy system wide.
Instead, we recommend that you install Scrapy within a so-called “virtual environment” (venv). Virtual environments allow you to not conflict with already-installed Python system packages (which could break some of your system tools and scripts), and still install packages normally with pip (without sudo and the likes).
Once you have created a virtual environment, you can install Scrapy inside it with pip, just like any other Python package. (See platform-specific guides below for non-Python dependencies that you may need to install beforehand).
Platform specific installation notes
Windows
Though it’s possible to install Scrapy on Windows using pip, we recommend you to install Anaconda or Miniconda and use the package from the conda-forge channel, which will avoid most installation issues.
Once you’ve installed Anaconda or Miniconda, install Scrapy with:
conda install -c conda-forge scrapy
Ubuntu 14.04 or above
Scrapy is currently tested with recent-enough versions of lxml, twisted and pyOpenSSL, and is compatible with recent Ubuntu distributions. But it should support older versions of Ubuntu too, like Ubuntu 14.04, albeit with potential issues with TLS connections.
Don’t use the python-scrapy package provided by Ubuntu, they are typically too old and slow to catch up with latest Scrapy.
To install Scrapy on Ubuntu (or Ubuntu-based) systems, you need to install these dependencies:
sudo apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
Inside a virtualenv, you can install Scrapy with pip after that:
pip install scrapy
macOS
Building Scrapy’s dependencies requires the presence of a C compiler and development headers. On macOS this is typically provided by Apple’s Xcode development tools. To install the Xcode command line tools open a terminal window and run:
xcode-select --install
(Optional) Install Scrapy inside a Python virtual environment.
This method is a workaround for the above macOS issue, but it’s an overall good practice for managing dependencies and can complement the first method. After any of these workarounds you should be able to install Scrapy:
pip install Scrapy
Creating a Project
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
scrapy startproject projectname
This will create a tutorial directory with the following contents:
projectname/
scrapy.cfg #deploy configuration file
tutorial/ #you'll import your code from here
__init__.py
items.py # project items definition file
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # folder you'll later put the spiders
__init__.py
Our First Spider
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
This is the code for our first Spider. Save it in a file named : example_spider.py
Under the example/spiders
directory in your project:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
def start_requests(self):
urls = [
'https://www.bol.com/nl/l/boeken/N/8299/',
'https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f' example-{ page }.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file { filename }')
As you can see, our Spider subclasses scrapy.Spider
and defines some attributes and methods:
name:
Identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
start_requests():
Must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
parse():
A method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The parse()
method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.
How to Run Our Spider
To put our spider to work, go to the project’s top level directory and run:
..\example> scrapy crawl example
This command runs the spider with name example that we’ve just added, that will send some requests for the bol.com and kitapyurdu.com domain. You will get an output similar to this:
2021-01-12 12:01:21 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: example)
2021-01-12 12:01:21 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0,
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i 8 Dec
2020), cryptography 3.3.1, Platform Windows-10-10.0.18362-SP0
2021-01-12 12:01:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-01-12 12:01:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
'NEWSPIDER_MODULE': 'example.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['example.spiders']}
…..
2021-01-12 12:01:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bol.com/robots.txt> (referer: None)
2021-01-12 12:01:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kitapyurdu.com/robots.txt> (referer: None)
2021-01-12 12:01:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bol.com/nl/l/boeken/N/8299/> (referer: None)
2021-01-12 12:01:23 [example] DEBUG: Saved file example-8299.html
2021-01-12 12:01:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html>
(referer: None)
2021-01-12 12:01:23 [example] DEBUG: Saved file example-kitap-edebiyat.html
2021-01-12 12:01:23 [scrapy.core.engine] INFO: Closing spider (finished)
2021-01-12 12:01:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 929,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 113064,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 1, 12, 11, 1, 23, 939458),
'log_count/DEBUG': 6,
'log_count/INFO': 10,
'response_received_count': 4,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/200': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2021, 1, 12, 11, 1, 22, 740037)}
2021-01-12 12:01:24 [scrapy.core.engine] INFO: Spider closed (finished)
/example>
Now, check the files in the current directory. You should notice that two new files have been created:
example-bol-1.html
and example-kitapyurdu-2.html
, with the content for the respective URLs, as our parse method instructs.
If you are wondering why we haven’t parsed the HTML yet, hold on, we will cover that soon.
<!DOCTYPE html>
<html class="no-js is-desktop"
lang="nl-NL">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1 user-scalable=no">
<meta name="format-detection" content="telephone=no">
<link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png?v=476aOAdO8j">
<link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png?v=476aOAdO8j">
<link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png?v=476aOAdO8j">
<link rel="manifest" href="/site.webmanifest?v=476aOAdO8j">
<link rel="mask-icon" href="/safari-pinned-tab.svg?v=476aOAdO8j" color="#0000a4">
<link rel="shortcut icon" href="/favicon.ico?v=476aOAdO8j">
<meta name="apple-mobile-web-app-title" content="Bol.com">
<meta name="application-name" content="Bol.com">
<meta name="msapplication-TileColor" content="#0000a4">
<meta name="theme-color" content="#0000a4">
<link rel="canonical" href="https://www.bol.com/nl/l/boeken/N/8299/">
<link rel="alternate" hreflang="nl-BE" href="https://www.bol.com/be/l/boeken/N/8299/">
<link rel="alternate" hreflang="nl-NL" href="https://www.bol.com/nl/l/boeken/N/8299/">
<!DOCTYPE html>
<html dir="ltr" lang="tr">
<head>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-NHPLZ5F');</script>
<!-- End Google Tag Manager -->
<meta charset="UTF-8" />
<meta name="robots" content="noarchive" />
<meta name="robots" content="index,follow" />
<title> Edebiyat Kitapları - En Yeni ve En Çok Satan Edebiyat Kitapları - kitapyurdu.com</title>
<base href="https://www.kitapyurdu.com/" />
<!--[if IE]><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><![endif]-->
<meta name="google-play-app" content="app-id=com.mobisoft.kitapyurdu">
<meta name="apple-itunes-app" content="app-id=489855982">
<meta name="msvalidate.01" content="DB7793DB335E4894B3A42C1CC528CBD4" />
<meta name="description" content="Edebiyat kategorisinde çok satan yeni çıkan ve tüm kitaplara hızlıca
ulaşıp satın alabilirsiniz." />
<meta name="keywords" content="Edebiyat,Edebiyat kategorisine ait kitaplar, Edebiyat kitapları, son çık
an Edebiyat kitapları, en ucuz Edebiyat, en yeni Edebiyat ürünleri, en cok satan Edebiyat kitapları,
Edebiyat kitapları yorumları" />
<link href="https://img.kitapyurdu.com/v1/getImage/fn:11194713/wh:cde76d960" rel="shortcut icon" />
<link rel="manifest" href="manifest.json" />
<link rel="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.min.css?4
.2020.12.01-d10" />
<link rel="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/jBox.css?4.2020.12.0
1-d10" />
<link rel="stylesheet" type="text/css" href="catalog/view/javascript/jquery/rating/jquery.rating.css?4.
2020.12.01-d10" media="screen" />
<link rel="stylesheet" type="text/css" href="catalog/view/javascript/jquery/ui-1.12.1/jqueryui.
min.css" />
<link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/fontawesome.
min.css" rel="stylesheet"/>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
What just happened under the hood?
Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon receiving a response for each one, it instantiates Response objects and calls the callback method associated with the request (in this case, the parse method) passing the response as argument.
A shortcut to the start_requests method:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.bol.com/nl/l/boeken/N/8299/',
'https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html',
]
Extracting Data
With Scrapy shell:


Using the shell, you can try selecting elements using CSS with the response object:
Find Bol.com’s title;

CSS code:

XPath:
Besides CSS, Scrapy selectors also support using XPath expressions:


XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the shell.
While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.


Example-1: Find Book Names on bol.com
Look for titles in example-bol-1.html
Where are they?
<div class="produc-title--inline"> <a class="product-title">
De spin die veel scheetjes laat
</div>
All titles are in the same place.


Get text information;
response.css(“div.classname a.classname :: text”).get()
or getall()
response.css(“div.product-title—inline a.product-title :: text”).get()

Example-2: Find Book Names, Author Names and Price on https://www.kitapyurdu.com/kategori/kitapedebiyat/128.html

Look at website! You can see book names, author names and price details on screen. If you click the "view" button, you can view the html codes like this:

Type this on your terminal:

You can view the result and interact with the console:

You can find book names, author names and price details within <li class="mg-b-10">...</li>
block, so type the following in the console:
>>>response.css(“li.mg-b-10”)

>>>products=response.css(“li.mg-b-10”)
>>>for i in products:
... print("Name= ", i.css("span ::text").get())
... print("Author= ", i.css("div.author.compact.ellipsis a ::text").get())
... print("Price= ", i.css("div.price-new span.value ::text").get())


Extracting Data in our Spider
Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.
A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the yield
Python keyword in the callback, as you can see below:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls=["https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html"]
def parse(self, response):
products=response.css("li.mg-b-10")
for products in products:
price=products.css("div.price-new span.value ::text").extract()
author=products.css("div.author.compact.ellipsis a ::text").extract()
book_name=products.css("div.name.ellipsis span ::text").extract()
yield {"price":price,"author":author,"book_name":book_name
In your terminal type the below code and press enter:
scrapy crawl example -o example.json
After executing the code, it writes all scraped data to 'example.json' file:

Last updated
Was this helpful?