[Part 1] – Understanding/Evaluating Web Scraping, Crawling, and Automation with NodeJS Libraries
This article is part of Open-Source Bolster Engineering and Research aimed at evaluating the performance and working of various libraries available in NodeJS for Web Scraping. In part 1 of this blog, we will see how we can use different libraries in Node JS to implement web scraping. In the subsequent blog posts, we will analyze various NodeJS libraries.
You can learn before beginning this article from the following link.
So, let’s begin here!
Web Scraping is about extracting information from web pages. A website can consist of various types of information, including text, images, audio, videos, scripts, and forms. Before beginning, I would like to clarify the concept of crawling over scraping.
Crawling vs Scraping in Web
When we want to search for some information, crawling is the way. When we want to extract information, scraping is the way. So web crawling would mean movement through links or URLs and web scraping means the extraction of information from a particular page/website.
Consider the following example: you want to find a person’s contact information from a website. Crawling can help find a specific page, like a contact page or about us page, and scraping can help get the contact information of the person.
Have you heard of Web Automation?
When reading about web crawling and scraping, we often encounter the term “web automation”. Once scraping is carried out, we can automate tasks like form submission, data extraction, testing, and validation. We will discuss some web automation techniques in the upcoming articles.
We will use various libraries in NodeJS to demonstrate the quick implementation of scraping. We will scrap the content of the title tag in this article using various libraries.
As per its official guide, Playwright can either be used as a part of the Playwright Test test runner (this guide), or as a Playwright Library.
Playwright Test was created specifically to accommodate the needs of end-to-end testing. It does everything you would expect from the regular test runner and more. Playwright test allows to:
- Run tests across all browsers.
- Execute tests in parallel.
- Enjoy context isolation out of the box.
- Capture videos, screenshots, and other artifacts on failure.
- Integrate your POMs as extensible fixtures
As per the official guide, Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
What can I do?
Most things that you can do manually in the browser can be done using Puppeteer!
Here are a few examples to get you started:
- Generate screenshots and PDFs of pages.
- Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e., “SSR” (Server-Side Rendering)).
- Automate form submission, UI testing, keyboard input, etc.
- Capture a timeline trace of your site to help diagnose performance issues.
- Test Chrome Extensions.
Other Scraping Libraries
During our quick run, we evaluated libraries that provided APIs to extract the title from the document of the requested URL. There are other libraries that you can try for scraping. They are Osmosis, and X-RAY which are more equipped with testing components. There are popular and advanced automation tools like Cypress and Selenium. Terms like web crawling, scraping or automation tools are found to be used interchangeably, but on functional consideration, they differ heavily.
There are various paid scraping options. These provide dashboards and tools to scrap websites. A simple search can land you multiple options.
Selection and comparison of Scraping Libraries
We will compare these libraries in upcoming articles. When considering using any scraping library it is important to consider the following points:
5) Active Community
Bolster is the only automated digital risk protection platform in the world that detects, analyses, and takes down fraudulent sites and content across the web, social media, app stores, marketplaces, and the dark web.
Interested in learning more about Bolster’s solutions? Request a demo here.
If you are interested in advanced cybersecurity research and working with cutting-edge AI, come work with us at Bolster. Check out open positions here.