Web Scraping
Web Scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Selenium Python
Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.
Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, Ie, Chrome, Remote etc. The current supported Python versions are 3.5 and above.
This documentation explains Selenium 2 WebDriver API. Selenium 1 / Selenium RC API is not covered here.
1.2. Installing Python bindings for Selenium
Use pip to install the selenium package. Python 3 has pip available in the standard library. Using pip, you can install selenium like this:
You may consider using virtualenv to create isolated Python environments. Python 3 has venv which is almost the same as virtualenv.
You can also download Python bindings for Selenium from the PyPI page for selenium package. and install manually.
Drivers
Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before the below examples can be run. Make sure it’s in your PATH, e. g., place it in /usr/bin or /usr/local/bin.
Failure to observe this step will give you an error selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.
Other supported browsers will have their own drivers available. Links to some of the more popular browser drivers follow.
Chrome:
Edge:
Firefox:
Safari:
For more information about driver installation, please refer the official documentation
Simple Usage
If you have installed Selenium Python bindings, you can start using it from Python like this.
Example Explained
The selenium.webdriver module provides all the WebDriver implementations. Currently supported WebDriver implementations are Firefox, Chrome, IE and Remote. The Keys class provide keys in the keyboard like RETURN, F1, ALT etc.
Next, the instance of Firefox WebDriver is created.
The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded (that is, the “onload” event has fired) before returning control to your test or script. Be aware that if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded:
The next line is an assertion to confirm that title has “Python” word in it:
WebDriver offers a number of ways to find elements using one of the find_element_by_* methods. For example, the input text element can be located by its name attribute using find_element_by_name method. A detailed explanation of finding elements is available in the Locating Elements chapter:
Next, we are sending keys, this is similar to entering keys using your keyboard. Special keys can be sent using Keys class imported from selenium.webdriver.common.keys. To be safe, we’ll first clear any pre-populated text in the input field (e.g. “Search”) so it doesn’t affect our search results:
After submission of the page, you should get the result if there is any. To ensure that some results are found, make an assertion:
Finally, the browser window is closed. You can also call quit method instead of close. The quit will exit entire browser whereas close will close one tab, but if just one tab was open, by default most browser will exit entirely.:
https://selenium-python.readthedocs.io/index.html
Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
Installing Beautiful Soup
If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:
Beautiful Soup 4 is published through PyPI, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3. Make sure you use the right version of pip or easy_install for your Python version (these may be named pip3 and easy_install3 respectively if you’re using Python 3).
The BeautifulSoup package is not what you want. That’s the previous major release, Beautiful Soup 3. Lots of software uses BS3, so it’s still available, but if you’re writing new code you should install beautifulsoup4.
If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py.
If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its bs4 directory into your application’s codebase, and use Beautiful Soup without installing it at all.
Example: https://www.mediamarkt.nl/
Installing a Parser
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:
Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:
Making The Soup
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:
Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)
Let's continue our example; https://www.mediamarkt.nl/
The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string
You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects: -Since it adds whitespace (in the form of newlines), prettify() changes the meaning of an HTML document and should not be used to reformat one.
The goal of prettify() is to help you visually understand the structure of the documents you work with.
For example, let's try to reach "Categorieen" article.
Let's do a little more special function:
From now on you can continue with the Python functions. For example:
For "Winkels":
For "Cadeaukaart":
Scrapy
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
Analyzing robots.txt
Actually most of the publishers allow programmers to crawl their websites at some extent. In other sense, publishers want specific portions of the websites to be crawled. To define this, websites must put some rules for stating which portions can be crawled and which cannot be. Such rules are defined in a file called robots.txt.
robots.txt is human readable file used to identify the portions of the website that crawlers are allowed as well as not allowed to scrape. There is no standard format of robots.txt file and the publishers of website can do modifications as per their needs. We can check the robots.txt file for a particular website by providing a slash and robots.txt after url of that website. For example, if we want to check it for Google.com, then we need to type https://www.google.com/robots.txt and we will get something as follows:
Some of the most common rules that are defined in a website’s robots.txt file are as follows:
The above rule means the robots.txt file asks a crawler with BadCrawler user agent not to crawl their website.
The above rule means the robots.txt file delays a crawler for 5 seconds between download requests for all user-agents for avoiding overloading server. The /trap link will try to block malicious crawlers who follow disallowed links. There are many more rules that can be defined by the publisher of the website as per their requirements.
Installing Scrapy
Scrapy requires Python 3.6+, either the CPython implementation (default) or the PyPy 7.2.0+ implementation (see Alternate Implementations).
If you’re using Anaconda or Miniconda, you can install the package from the condaforge channel, which has up-to-date packages for Linux, Windows and macOS.
If you’re using Anaconda or Miniconda, you can install the package from the condaforge channel, which has up-to-date packages for Linux, Windows and macOS. To install Scrapy using conda, run:
Alternatively, if you’re already familiar with installation of Python packages, you can install Scrapy and its dependencies from PyPI with:
Note that sometimes this may require solving compilation issues for some Scrapy dependencies depending on your operating system, so be sure to check the platform specific installation notes.
The minimal versions which Scrapy is tested against are: Twisted 14.0, lxml 3.4, pyOpenSSL 0.14.
Using a virtual environment (recommended)
We recommend installing Scrapy inside a virtual environment on all platforms.
Python packages can be installed either globally (a.k.a system wide), or in user-space. We do not recommend installing Scrapy system wide.
Instead, we recommend that you install Scrapy within a so-called “virtual environment” (venv). Virtual environments allow you to not conflict with already-installed Python system packages (which could break some of your system tools and scripts), and still install packages normally with pip (without sudo and the likes).
Once you have created a virtual environment, you can install Scrapy inside it with pip, just like any other Python package. (See platform-specific guides below for non-Python dependencies that you may need to install beforehand).
Platform specific installation notes
Windows
Though it’s possible to install Scrapy on Windows using pip, we recommend you to install Anaconda or Miniconda and use the package from the conda-forge channel, which will avoid most installation issues.
Once you’ve installed Anaconda or Miniconda, install Scrapy with:
Ubuntu 14.04 or above
Scrapy is currently tested with recent-enough versions of lxml, twisted and pyOpenSSL, and is compatible with recent Ubuntu distributions. But it should support older versions of Ubuntu too, like Ubuntu 14.04, albeit with potential issues with TLS connections.
Don’t use the python-scrapy package provided by Ubuntu, they are typically too old and slow to catch up with latest Scrapy.
To install Scrapy on Ubuntu (or Ubuntu-based) systems, you need to install these dependencies:
Inside a virtualenv, you can install Scrapy with pip after that:
The same non-Python dependencies can be used to install Scrapy in Debian Jessie (8.0) and above.
macOS
Building Scrapy’s dependencies requires the presence of a C compiler and development headers. On macOS this is typically provided by Apple’s Xcode development tools. To install the Xcode command line tools open a terminal window and run:
(Optional) Install Scrapy inside a Python virtual environment.
This method is a workaround for the above macOS issue, but it’s an overall good practice for managing dependencies and can complement the first method. After any of these workarounds you should be able to install Scrapy:
Creating a Project
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
This will create a tutorial directory with the following contents:
projectname/
scrapy.cfg #deploy configuration file
tutorial/ #you'll import your code from here
__init__.py
items.py # project items definition file
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # folder you'll later put the spiders
__init__.py
Our First Spider
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
This is the code for our first Spider. Save it in a file named : example_spider.py
Under the example/spiders
directory in your project:
As you can see, our Spider subclasses scrapy.Spider
and defines some attributes and methods:
name:
Identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
start_requests():
Must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
parse():
A method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The parse()
method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.
How to Run Our Spider
To put our spider to work, go to the project’s top level directory and run:
This command runs the spider with name example that we’ve just added, that will send some requests for the bol.com and kitapyurdu.com domain. You will get an output similar to this:
Now, check the files in the current directory. You should notice that two new files have been created:
example-bol-1.html
and example-kitapyurdu-2.html
, with the content for the respective URLs, as our parse method instructs.
If you are wondering why we haven’t parsed the HTML yet, hold on, we will cover that soon.
What just happened under the hood?
Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon receiving a response for each one, it instantiates Response objects and calls the callback method associated with the request (in this case, the parse method) passing the response as argument.
A shortcut to the start_requests method:
Extracting Data
With Scrapy shell:
Using the shell, you can try selecting elements using CSS with the response object:
Find Bol.com’s title;
CSS code:
XPath:
Besides CSS, Scrapy selectors also support using XPath expressions:
XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the shell.
While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.
Example-1: Find Book Names on bol.com
Look for titles in example-bol-1.html
Where are they?
<div class="produc-title--inline"> <a class="product-title">
De spin die veel scheetjes laat
</div>
All titles are in the same place.
Get text information;
response.css(“div.classname a.classname :: text”).get()
or getall()
response.css(“div.product-title—inline a.product-title :: text”).get()
Example-2: Find Book Names, Author Names and Price on https://www.kitapyurdu.com/kategori/kitapedebiyat/128.html
Look at website! You can see book names, author names and price details on screen. If you click the "view" button, you can view the html codes like this:
Type this on your terminal:
You can view the result and interact with the console:
You can find book names, author names and price details within <li class="mg-b-10">...</li>
block, so type the following in the console:
Extracting Data in our Spider
Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.
A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the yield
Python keyword in the callback, as you can see below:
In your terminal type the below code and press enter:
After executing the code, it writes all scraped data to 'example.json' file:
Last updated