Web scraping, web harvesting, or web data extraction is used for from .While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a or . It is a form of copying in which specific data is gathered and copied from the web, typically into a central local or spreadsheet, for later retrieval or analysis.
Selenium Python
Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.
Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, Ie, Chrome, Remote etc. The current supported Python versions are 3.5 and above.
This documentation explains Selenium 2 WebDriver API. Selenium 1 / Selenium RC API is not covered here.
1.2. Installing Python bindings for Selenium
Use to install the selenium package. Python 3 has pip available in the . Using pip, you can install selenium like this:
pip install selenium
You may consider using to create isolated Python environments. Python 3 has which is almost the same as virtualenv.
You can also download Python bindings for Selenium from the . and install manually.
Drivers
Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires , which needs to be installed before the below examples can be run. Make sure it’s in your PATH, e. g., place it in /usr/bin or /usr/local/bin.
Failure to observe this step will give you an error selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.
Other supported browsers will have their own drivers available. Links to some of the more popular browser drivers follow.
Chrome:
Edge:
Firefox:
Safari:
Simple Usage
If you have installed Selenium Python bindings, you can start using it from Python like this.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
Example Explained
The selenium.webdriver module provides all the WebDriver implementations. Currently supported WebDriver implementations are Firefox, Chrome, IE and Remote. The Keys class provide keys in the keyboard like RETURN, F1, ALT etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
Next, the instance of Firefox WebDriver is created.
driver = webdriver.Firefox()
The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded (that is, the “onload” event has fired) before returning control to your test or script. Be aware that if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded:
driver.get("http://www.python.org")
The next line is an assertion to confirm that title has “Python” word in it:
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
Next, we are sending keys, this is similar to entering keys using your keyboard. Special keys can be sent using Keys class imported from selenium.webdriver.common.keys. To be safe, we’ll first clear any pre-populated text in the input field (e.g. “Search”) so it doesn’t affect our search results:
After submission of the page, you should get the result if there is any. To ensure that some results are found, make an assertion:
assert "No results found." not in driver.page_source
Finally, the browser window is closed. You can also call quit method instead of close. The quit will exit entire browser whereas close will close one tab, but if just one tab was open, by default most browser will exit entirely.:
driver.close()
Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
Installing Beautiful Soup
If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:
apt-get install python3-bs4 (for Python 3)
Beautiful Soup 4 is published through PyPI, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3. Make sure you use the right version of pip or easy_install for your Python version (these may be named pip3 and easy_install3 respectively if you’re using Python 3).
The BeautifulSoup package is not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4.
If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.
python setup.py install
If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all.
<Response[200]>
b’<!DOCTYPE html>
<!--[if lt IE 7]><html class="no-js lt-ie9 lt-ie8 lt-ie7"><![endif]-->
<!--[if IE 7]><html class="no-js lt-ie9 lt-ie8 ie7"><![endif]-->
<!--[if IE 8]><html class="no-js lt-ie9 ie8"><![endif]-->
<!--[if gt IE 8]><!--><html class="no-js"><!--<![endif]-->
<head>
<title>MediaMarkt</title>
<meta charset="utf-8" />
<link rel="canonical" href="https://www.mediamarkt.nl/" />
<meta property="pageTypeId" content="" data-id="home" />
<link rel="alternate" hreflang="nl-NL" href="https://www.mediamarkt.nl/" />
<meta name="description" content="MediaMarkt is de nummer
één winkelketen voor consumentenelektronica in Europa. Niet alleen
het grootste assortiment onder één dak, maar ook altijd de nieuwste
en innovatiefste producten." />
<meta name="robots" content="index,follow" />
<meta name="google-site-verification" content="iummHD8QhQJsCT42KhsL1s36YwVG81ZpSLyAVQ06rM"
/>
<script type="text/javascript"
src="/dt/ruxitagentjs_ICA2dfgjmqru_10207210127152629.js" datadtconfig="
app=0fe7eb5fcee005af|cuc=ikduodw0|mel=100000|featureHash=ICA2dfgjmqr
u|lastModification=1612284241790|dtVersion=10207210127152629|tp=500,50,0,1|rdn
t=1|uxrgce=1|uxdcw=1500|vs=2|agentUri=/dt/ruxitagentjs_ICA2dfgjmqru_1020721012
7152629.js|reportUrl=/dt/rb_0ca29162-4b29-4a47-946a-081c13f47ee3|rid=RID_-
1677218119|rpid=-963955369|domain=mediamarkt.nl"></script><link rel="appletouch-
icon" sizes="57x57" href="/apple-touch-icon-57x57.png">
<link rel="icon" type="image/png" href="/android-chrome-192x192.png"
sizes="192x192">
<link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16">
<link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32">
<link rel="manifest" href="/manifest.json">
<link rel="mask-icon" href="/safari-pinned-tab.svg" color="#000000">
<meta name="msapplication-TileColor" content="#000000">
<meta name="msapplication-TileImage" content="/mstile-144x144.png">
<meta property="product-container" content="/nl/productcontainer/products.json"
data-param="catEntryId" datacallback="
mcs.productContainer.initProducts" />
<meta property="agerating" content="" data-cb-showlinks="
mcs.displayMediaPlayerLinks"
data-cb-display-layer="mcs.displayAgeRatingLayer" data-cbopen-
player="mcs.openMediaPlayer" /> ...continue
Installing a Parser
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:
Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
soup = BeautifulSoup("<html>a web page</html>", 'html.parser')
First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:
The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string
You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects: -Since it adds whitespace (in the form of newlines), prettify() changes the meaning of an HTML document and should not be used to reformat one.
The goal of prettify() is to help you visually understand the structure of the documents you work with.
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
Analyzing robots.txt
Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.
Some of the most common rules that are defined in a website’s robots.txt file are as follows:
User-agent: BadCrawler
Disallow: /
The above rule means the robots.txt file asks a crawler with BadCrawler user agent not to crawl their website.
User-agent: *
Crawl-delay: 5
Disallow: /trap
The above rule means the robots.txt file delays a crawler for 5 seconds between download requests for all user-agents for avoiding overloading server. The /trap link will try to block malicious crawlers who follow disallowed links. There are many more rules that can be defined by the publisher of the website as per their requirements.
Installing Scrapy
Scrapy requires Python 3.6+, either the CPython implementation (default) or the PyPy 7.2.0+ implementation (see Alternate Implementations).
If you’re using Anaconda or Miniconda, you can install the package from the condaforge channel, which has up-to-date packages for Linux, Windows and macOS.
If you’re using Anaconda or Miniconda, you can install the package from the condaforge channel, which has up-to-date packages for Linux, Windows and macOS. To install Scrapy using conda, run:
conda install -c conda-forge scrapy
Alternatively, if you’re already familiar with installation of Python packages, you can install Scrapy and its dependencies from PyPI with:
pip install Scrapy
Note that sometimes this may require solving compilation issues for some Scrapy dependencies depending on your operating system, so be sure to check the platform specific installation notes.
The minimal versions which Scrapy is tested against are: Twisted 14.0, lxml 3.4, pyOpenSSL 0.14.
Using a virtual environment (recommended)
We recommend installing Scrapy inside a virtual environment on all platforms.
Python packages can be installed either globally (a.k.a system wide), or in user-space. We do not recommend installing Scrapy system wide.
Instead, we recommend that you install Scrapy within a so-called “virtual environment” (venv). Virtual environments allow you to not conflict with already-installed Python system packages (which could break some of your system tools and scripts), and still install packages normally with pip (without sudo and the likes).
Once you have created a virtual environment, you can install Scrapy inside it with pip, just like any other Python package. (See platform-specific guides below for non-Python dependencies that you may need to install beforehand).
Platform specific installation notes
Windows
Though it’s possible to install Scrapy on Windows using pip, we recommend you to install Anaconda or Miniconda and use the package from the conda-forge channel, which will avoid most installation issues.
Once you’ve installed Anaconda or Miniconda, install Scrapy with:
conda install -c conda-forge scrapy
Ubuntu 14.04 or above
Scrapy is currently tested with recent-enough versions of lxml, twisted and pyOpenSSL, and is compatible with recent Ubuntu distributions. But it should support older versions of Ubuntu too, like Ubuntu 14.04, albeit with potential issues with TLS connections.
Don’t use the python-scrapy package provided by Ubuntu, they are typically too old and slow to catch up with latest Scrapy.
To install Scrapy on Ubuntu (or Ubuntu-based) systems, you need to install these dependencies:
Inside a virtualenv, you can install Scrapy with pip after that:
pip install scrapy
The same non-Python dependencies can be used to install Scrapy in Debian Jessie (8.0) and above.
macOS
Building Scrapy’s dependencies requires the presence of a C compiler and development headers. On macOS this is typically provided by Apple’s Xcode development tools. To install the Xcode command line tools open a terminal window and run:
xcode-select --install
(Optional) Install Scrapy inside a Python virtual environment.
This method is a workaround for the above macOS issue, but it’s an overall good practice for managing dependencies and can complement the first method. After any of these workarounds you should be able to install Scrapy:
pip install Scrapy
Creating a Project
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
scrapy startproject projectname
This will create a tutorial directory with the following contents:
projectname/
scrapy.cfg #deploy configuration file
tutorial/ #you'll import your code from here
__init__.py
items.py # project items definition file
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # folder you'll later put the spiders
__init__.py
Our First Spider
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
This is the code for our first Spider. Save it in a file named : example_spider.py
Under the example/spiders directory in your project:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
def start_requests(self):
urls = [
'https://www.bol.com/nl/l/boeken/N/8299/',
'https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f' example-{ page }.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file { filename }')
As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:
name: Identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
start_requests(): Must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
parse(): A method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.
How to Run Our Spider
To put our spider to work, go to the project’s top level directory and run:
..\example> scrapy crawl example
This command runs the spider with name example that we’ve just added, that will send some requests for the bol.com and kitapyurdu.com domain. You will get an output similar to this:
<!DOCTYPE html>
<html dir="ltr" lang="tr">
<head>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-NHPLZ5F');</script>
<!-- End Google Tag Manager -->
<meta charset="UTF-8" />
<meta name="robots" content="noarchive" />
<meta name="robots" content="index,follow" />
<title> Edebiyat Kitapları - En Yeni ve En Çok Satan Edebiyat Kitapları - kitapyurdu.com</title>
<base href="https://www.kitapyurdu.com/" />
<!--[if IE]><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><![endif]-->
<meta name="google-play-app" content="app-id=com.mobisoft.kitapyurdu">
<meta name="apple-itunes-app" content="app-id=489855982">
<meta name="msvalidate.01" content="DB7793DB335E4894B3A42C1CC528CBD4" />
<meta name="description" content="Edebiyat kategorisinde çok satan yeni çıkan ve tüm kitaplara hızlıca
ulaşıp satın alabilirsiniz." />
<meta name="keywords" content="Edebiyat,Edebiyat kategorisine ait kitaplar, Edebiyat kitapları, son çık
an Edebiyat kitapları, en ucuz Edebiyat, en yeni Edebiyat ürünleri, en cok satan Edebiyat kitapları,
Edebiyat kitapları yorumları" />
<link href="https://img.kitapyurdu.com/v1/getImage/fn:11194713/wh:cde76d960" rel="shortcut icon" />
<link rel="manifest" href="manifest.json" />
<link rel="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.min.css?4
.2020.12.01-d10" />
<link rel="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/jBox.css?4.2020.12.0
1-d10" />
<link rel="stylesheet" type="text/css" href="catalog/view/javascript/jquery/rating/jquery.rating.css?4.
2020.12.01-d10" media="screen" />
<link rel="stylesheet" type="text/css" href="catalog/view/javascript/jquery/ui-1.12.1/jqueryui.
min.css" />
<link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/fontawesome.
min.css" rel="stylesheet"/>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
What just happened under the hood?
Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon receiving a response for each one, it instantiates Response objects and calls the callback method associated with the request (in this case, the parse method) passing the response as argument.
A shortcut to the start_requests method:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://www.bol.com/nl/l/boeken/N/8299/',
'https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html',
]
Extracting Data
With Scrapy shell:
Using the shell, you can try selecting elements using CSS with the response object:
Find Bol.com’s title;
CSS code:
XPath:
Besides CSS, Scrapy selectors also support using XPath expressions:
XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the shell.
While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.
Example-1: Find Book Names on bol.com
Look for titles in example-bol-1.htmlWhere are they?
<div class="produc-title--inline"> <a class="product-title">
De spin die veel scheetjes laat
</div>
All titles are in the same place.
Get text information;
response.css(“div.classname a.classname :: text”).get() or getall()
Example-2: Find Book Names, Author Names and Price on https://www.kitapyurdu.com/kategori/kitapedebiyat/128.html
Look at website! You can see book names, author names and price details on screen. If you click the "view" button, you can view the html codes like this:
Type this on your terminal:
You can view the result and interact with the console:
You can find book names, author names and price details within <li class="mg-b-10">...</li> block, so type the following in the console:
>>>response.css(“li.mg-b-10”)
>>>products=response.css(“li.mg-b-10”)
>>>for i in products:
... print("Name= ", i.css("span ::text").get())
... print("Author= ", i.css("div.author.compact.ellipsis a ::text").get())
... print("Price= ", i.css("div.price-new span.value ::text").get())
Extracting Data in our Spider
Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.
A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the yield Python keyword in the callback, as you can see below:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls=["https://www.kitapyurdu.com/kategori/kitap-edebiyat/128.html"]
def parse(self, response):
products=response.css("li.mg-b-10")
for products in products:
price=products.css("div.price-new span.value ::text").extract()
author=products.css("div.author.compact.ellipsis a ::text").extract()
book_name=products.css("div.name.ellipsis span ::text").extract()
yield {"price":price,"author":author,"book_name":book_name
In your terminal type the below code and press enter:
scrapy crawl example -o example.json
After executing the code, it writes all scraped data to 'example.json' file:
For more information about driver installation, please refer the
WebDriver offers a number of ways to find elements using one of the find_element_by_* methods. For example, the input text element can be located by its name attribute using find_element_by_name method. A detailed explanation of finding elements is available in the chapter:
Example:
Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See .)
Let's continue our example;
robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type and we will get something as follows: