A Detailed Guide on Web Scraping using Python framework!

Source Node: 1126028

tag.

We must set the path to chrome driver to configure web driver to use Chrome browser.

Next, we have to write code to open the URL and store the extracted details in a list,

It’s time to extract the data from the website now that we’ve built the code to open the URL. The data we wish to extract is nested in

tags, as previously stated. As a result, I’ll look for div tags with those class names, extract the data, and save it in a variable. Please see the code below:

content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True, attrs={'class':'_31qSD5'}):
name=a.find('div', attrs={'class':'_3wU53n'})
price=a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
rating=a.find('div', attrs={'class':'hGSR34 _2beYZw'})
products.append(name.text)
prices.append(price.text)
ratings.append(rating.text) 

Step 5: Run the code and extract the data.

Run the code by using the below command:

	python web-scrap.py 

Step 6: Save the information in an appropriate format.

Save the data in a format after you’ve extracted it. Depending on your needs, this format may differ. We’ll save the extracted data in CSV (Comma Separated Value) format in this example. To accomplish this, I’ll include the following lines in my code:

df = pd.DataFrame({'Product Name':products,'Price':prices,'Rating':ratings}) df.to_csv('products.csv', index=False, encoding='utf-8')

Now run the entire program again,

A file called “products.csv” is created, which contains the extracted data.

the output will be:

Checking the CSV file | web Scraping using Python

Selenium is frequently used to extract data from websites that contain a large amount of Javascript. Running a large number of Selenium/Headless Chrome instances at scale is difficult.

Web scraping/crawling using scrapy

We can install scrapy via pip command. However, the Scrapy literature strongly advises that it be installed in a specialized virtual environment to minimize problems with your system programs.

Virtualenv and Virtualenvwrapper are what I’m using:

mkvirtualenv scrapy_env
pip install Scrapy

With this command, you may now create a new Scrapy project:

scrapy startproject web_scrap

All the project’s boilerplate files will be created because of this.

├── web_scrap

│ ├── __init__.py

│ ├── __pycache__

│ ├── items.py

│ ├── middlewares.py

│ ├── pipelines.py

│ ├── settings.py

│ └── spiders

│ ├── __init__.py

│ └── __pycache__

└── scrap.cfg

I tried to explain all the folders and files below.

items.py is a model for the data that has been extracted. You can create your model (for example, a product) that inherits the scrapy item class.

pipelines.py: we use pipelines in scrapy to process the extracted data, clean the HTML, validate the data, and save it to a database or export it to a custom format.

middlewares.py: The request/response life-cycle was changed using middleware. For example, instead of executing the requests yourself, you might construct a middleware to rotate user-agents or use an API.

scrapy.cfg is a configuration file that allows you to adjust some settings.

We can find spider class in the /spiders folder. Spiders are scrapy classes that define how to scrape a website, including which links to follow and how to collect data from those links.

The product name, image, price, and description will be extracted.

mamaearth website

Image 2

Shell Scrapy

Scrapy includes a built-in shell that may debug scraping code in real-time. It quickly tests your XPath expressions and CSS selectors. It’s a fantastic tool for writing web scrapers, and I use it all the time!

We can configure scrapy Shell to use a different console than the usual Python console, such as IPython. You’ll receive auto-completion and other useful features like colorized output.

You must add the following line to your scrapy.cfg file to use it in your scrapy Shell:

shell = ipython

Once we configured ipython, we can start using scrapy shell:

$ scrapy shell --nolog
Scrapy Shell | web Scraping using Python

Start fetching the URL simply by using the command below:

fetch('https://mamaearth.in/product/mamaearth-onion-shampoo-for-hair-growth-hair-fall-control-with-onion-oil-plant-keratin-250-ml')

The /robot.txt file will be fetched first.

[scrapy.core.engine] DEBUG: Crawled (404) <GET https://mamaearth.in/product/robots.txt> (referer: None)

Because there is no robot.txt in this scenario, we get a 404 HTTP code. Scrapy will default to following the rule if there is a robot.txt file.

This behavior can be disabled by altering the following setting in settings.py:

ROBOTSTXT_OBEY = True 

Data extraction:

Scrapy doesn’t run Javascript by default, so if the website you’re scraping has a frontend framework like Angular or React.js, you might have difficulties getting the data you need.

Let’s use an XPath expression to get the product title and price:

We’ll use an XPath expression to extract the price, and we’ll choose the first span after the div with the class Flex-sc-1lsr9yp-0 fiXUrs PriceRevamp-sc-13vrskg-1 jjWoWj.

response.xpath("//div[@class='Flex-sc-1lsr9yp-0 fiXUrs PriceRevamp-sc-13vrskg-1 jjWoWj']/span/text()").get()

Creating a Scrapy Spider class

Spiders are scrapy classes that determine your crawling (what links / URLs should be scraped) and scraping behavior.

Here are the several processes a spider class uses to scrape a website:

start_urls and start requests() are used as the method to call these URLs. If you need to alter the HTTP verb or add any parameters to the request, you can override this method.

For each URL, it will create a Request object and send the response to the callback function parse(). The data (in our example, the product price, image, description, and title) will then be extracted by the parse() method, which will return a dictionary, an Item object, a Request, or an iterable.

You can return scraped data as a basic Python dictionary with Scrapy, but it’s better to use the Scrapy Item class.

Item class: It’s just a simple container for our scraped data, and Scrapy will use the fields of this item for a variety of purposes, like exporting the data to multiple formats (JSON / CSV…), the item pipeline, and so on.

let’s write the python code for the product class:

import scrapy
class Product(scrapy.Item): price = scrapy.Field() product_url = scrapy.Field() title = scrapy.Field() img_url = scrapy.Field()

We can now create a spider using the command-line helper:

scrapy genspider myspider mydomain.com

In Scrapy, there are several sorts of Spiders that can tackle the most frequent web scraping problems:

We’re going to use a spider class. It takes a list of start URLs and uses a parse function to scrape each one.

CrawlSpider follows links that are determined by rules.

The URLs defined in a sitemap are extracted by the Sitemap spider.

There are two needed attributes in the EcomSpider class:

Name, which is the name of our Spider (which you can run with scrapy runspider spider_name)

The beginning URL is start_urls.

When using a CrawlSpider that can track links on other domains, the allowed domains parameter is critical.

import scrapy
from product_scraper.items import Product
class EcomSpider(scrapy.Spider): name = 'ecom_spider' allowed_domains = ['mamaearth.in'] start_urls = ['https://mamaearth.in/product/mamaearth-onion-shampoo-for-hair-growth-hair-fall-control-with-onion-oil-plant-keratin-250-ml/'] def parse(self, response): item = Product() item['product_url'] = response.url item['price'] = response.xpath("//div[@class='Flex-sc-1lsr9yp-0 fiXUrs PriceRevamp-sc-13vrskg-1 jjWoWj']/span/text()").get() item['title'] = response.xpath('//section[1]//h1/text()').get() item['img_url'] = response.xpath("//div[@class='product-slider']//img/@src").get(0) return item

To export the output into JSON, run the code as follows (you could also export to CSV)

scrapy runspider ecom_spider.py -o product.json

Extracted JSON file will look like the below,

JSON file

Scraping many pages

It’s time to learn how to scrape many pages, such as the full product catalog, now that we know how to scrape a single page. Spiders come in a variety of shapes and sizes, as we saw before.

A sitemap should be the first thing you look at if you want to scrape a full product catalog. I created specific sitemaps for this purpose, to show web crawlers how the website is organized.

A sitemap.xml file can usually be found at base URL/sitemap.xml. Scrapy assists you to parse a sitemap, which might be challenging.

 https://mamaearth.in/product/ 2019-10-17T11:22:16+06:00 https://mamaearth.in/product/mamaearth-onion-shampoo-for-hair-growth-hair-fall-control-with-onion-oil-plant-keratin-250-ml/ 2019-10-17T11:22:16+06:00

Fortunately, we can limit URLs to read only those that fit a pattern; it’s very simple; Here, we only want URLs with /products/ in them:

class SitemapSpider(SitemapSpider): name = "sitemap_spider" sitemap_urls = ['https://mamaearth.in/product/sitemap.xml'] sitemap_rules = [ ('/products/', 'parse_product') ] def parse_product(self, response): # ... scrap product ...

To scrape all the products and export the results to a CSV file, run this spider:

scrapy runspider sitemap_spider.py -o output.csv

What if there was no sitemap on the website? Scrapy has a solution!

Starting with a start URLs list, the Crawl spider will crawl the target website. Then, based on a set of rules, it will extract all the links for each URL. It’s simple in our instance because all goods have the same URL pattern of /products/product title, therefore all we have to do is filter these URLs.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from product_scraper.productloader import ProductLoader
from product_scraper.items import Product
class MySpider(CrawlSpider): name = 'crawl_spider' allowed_domains = ['mamaearth.in'] start_urls = ['https://mamaearth.in/product/'] rules = ( Rule(LinkExtractor(allow=('products', )), callback='parse_product'), ) def parse_product(self, response): # .. parse product

These built-in Spiders, as you can see, are simple to operate. It would have been far more difficult to build it from the ground up.

Scrapy takes care of the crawling logic for you, such as adding new URLs to a queue, keeping track of already parsed URLs, multi-threading, and so on.

About Myself

Hello, my name is Lavanya, and I’m from Chennai. I am a passionate writer and enthusiastic content maker. The most intractable problems always thrill me. I am currently pursuing my B. Tech in Chemical Engineering and have a strong interest in the fields of data engineering, machine learning, data science, and artificial intelligence, and I am constantly looking for ways to integrate these fields with other disciplines such as science and chemistry to further my research goals.

Linkedin URL: https://www.linkedin.com/in/lavanya-srinivas-949b5a16a/

Conclusion

I hope you found this blog post interesting! You should now be familiar with the Selenium API in Python and beautiful soup, as well as web crawling and scraping with scrapy. In this article, we looked at how to scrape the web with Scrapy and how it may help you address some of your most typical web scraping problems.

ENDNOTES

1) If you have to execute repetitive operations like filling out forms or reviewing information behind a login form on a website that doesn’t have an API, Selenium may be a pleasant choice.

2) It’s easy to see how Scrapy can help you save time and construct better maintainable scrapers if you’ve been conducting web scraping more “manually” with tools like BeautifulSoup / Requests.

3) If you want to add any opinions about scraping the web with Python, add them in the comment section. To know more about web scraping, Kindly read the upcoming article! Thank you.

Source: https://www.analyticsvidhya.com/blog/2021/10/a-detailed-guide-on-web-scraping-using-python-framework/

Time Stamp:

More from Analytics Vidhya