Web scraping is a technique for automatically extracting information from the web using software that simulates human web surfing.
Over the past few years, web scraping has become popular because it is an efficient way to extract the information needed from the various websites so that it can be used for further research.
Usage Of Data Scraping
It is common to use data, found on the internet when performing Data Science tasks. Usually, you can access these data in CSV format, or through an Application Programming Interface ( API). There are times however when the data that you want can only be accessed as part of a web page. In such cases, you will want to use a technique called web scraping to get the data from the web page into a format with which you can work in your analysis.
Components Of A Web Page
Our web browser makes a request to a Web server when we visit a web page. This request is called a GET request, as we are receiving server files. The server then sends back files telling us how to render the page to our browser.
The files fall into some key types:
- HTML — contain the main content of the page.
- CSS — add styling to make the page look nicer.
- JS — JavaScript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG, allow images to be displayed on web pages. It renders the page after our browser receives all the files, and displays it to us. There's a lot going on behind the scenes to render a page nicely but when we're web scraping, we don't have to worry about most of it. When we do web scraping we're interested in the web page's main content, so we're looking at the HTML.
Requirements For Web Scraping
You will be using the virtual environments Python 3 and Python
As per your needs, you need to install the following few packages:
- Scrapy to HTTP requests and to process HTML
- BeautifulSoup to handle all your HTML processing
- Requests to process your HTTP requests
- Selenium to browser use
Scrapy
Scrapy is an application framework for writing web spiders that crawl and extract data from websites. Scrapy provides an integrated data extraction mechanism (called selectors), but instead, you can easily use BeautifulSoup (or lxml) if you feel more comfortable working with them.
BeautifulSoup
On the other hand, BeautifulSoup is a useful utility that allows a programmer to pick up specific elements from a webpage (for example, a list of images). As such, BeautifulSoup alone isn't enough because you actually have to get the webpage first and that leads people to use something like requests or urllib2 to do that part. These tools act as a web browser and pick up pages from the internet so that BeautifulSoup can pick up the bits after an individual.
The Requests Library
The first thing we're going to need to do to scrap a web page is to download it. Using the Python requests library we can download the pages. The request library will submit a GET request to a web server, which will download for us the HTML content of a particular web page. There are several different kinds of requests we can make using requests, only one of which is GET.
Selenium
Selenium is an Automation Tool for Web Browsers. It is mainly for testing purposes to automate web applications but is certainly not limited to only that.
Boring web-based management tasks can (and should) also be automated and can be used to scrape websites that use JavaScript calls to display the data.
Pros Of Scrapy
- Scrapy allows easy post-processing of any data you find. Web data is something of a mess! It's very unlikely that the data you're finding will be in the exact format you'd like it to be: it might have extra line breaks; funky styling; extra commas in random places; or just be in all the upper cases. Scrapy will allow you to handle those cases in a straightforward manner.
- You will often find that web pages just blow up in your face when scraping: pages won't be found, servers will have errors or you might have problems with internet connectivity halfway through a large scrape. Scrapy allows you to gracefully handle errors and even has a built-in ability to resume a scrape from the last page it came across.
- All of this you get for free.
- Some websites are going to be behind a login curtain. Scrapy has built-in form handling that you can set up before starting your scrape to log in to the websites.
- Scrapy can make multiple requests at the same time, which allows for much faster scraping runs. If you're writing a scratch-based Python script that tries to do that, you'll probably find things could go wrong in a horrible million ways. Scrapy has years of use in large organizations that prevent this.
Why Requests And BeautifulSoup?
If you know you're not going to need any of the above or any scraping guidance in general then they're fantastic tools that offer lots of freedom. For one-off scripts, you don't plan on maintaining in the long run, they're probably the best solution.
When To Use Selenium?
Selenium is a web application/website automation testing framework which can also control the browser to navigate the website just like a human being. Selenium uses a web-driver package that can take over browser control and imitate user-oriented actions to trigger desired events. Web-scraping is one of the tasks of selenium to extract useful data and information which might otherwise be unavailable.
The only exception I might see for using Selenium as a web scraping tool is if a website you're scraping uses JavaScript to get/display the data you need to scrap.
Combining Our Data Into A Pandas Dataframe
Now we can combine the data in and analyze a Pandas DataFrame. A DataFrame is an object where tabular data can be processed, making data analysis simple. Pandas is a Python Library.
Conclusion
Now that we have seen several different ways of scraping data from websites and are ready to start working on potentially bigger projects, we may wonder if there are any legal implications of writing a piece of computer code that downloads information from the Internet.
The first and most important thing to be careful when writing a web scraper is that it typically involves repeated queries about a website and access to a potentially large number of pages
If over a short span of time we send too many such requests, we can prevent other "normal" users from accessing the site during that time, or even cause the server to run out of resources and crash.
It's important to recognize that web scraping can be illegal in certain circumstances. If we are specifically scraping the terms and conditions of the website prohibit the downloading and copying of its content, then we may have trouble scraping it.
However, if you are planning to start harvesting large amounts of data for research or commercial purposes, you probably should first seek legal counsel.
Soft Tech has expertise in data solutions. Please Contact us if you have any data needs.