Top 5 javascript web scraping libraries and frameworks [2021 updated]
Nowadays, web scripting is essential for countless professions, including researchers, debuggers, testers, etc. Here are the top javascript web scraping libraries and frameworks. Web scraping is not the only use case of these tools. You can use these tools to create automation tasks. For example, testing automation.
1. Puppeteer
Puppeteer ↱ is a Node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol ↱. Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium.
Puppeteer is simply the best Javascript web scraping framework ↱ that I have used so far among all other libraries and frameworks. Puppeteer is on par with popular Selenium ↱ (Python-based web scraping tool).
Main features of Puppeteer
- Headless ↱ browser API
- Generate screenshots and PDFs of pages
- SPA (Single-Page Application) and SSR (Server-Side Rendering) crawl support
- Measures Rendering and Load times by Chrome Performance Analysis tool
- Chrome extension testing support
- Automate most user interaction of the website (clicks, form submissions, scrolls, etc…)
Puppeteer Pros |
Puppeteer Cons |
✅ With its full-featured API, it covers a majority of use cases | ⛔ Only available for Chrome/Chromium browser |
✅ The best option for scraping Javascript websites on Chrome | ⛔ Steep learning curve |
Github stars: 70k+ | NPM downloads: 1,200,000+ (weekly)
2. Cheerio
Cheerio ↱ is a library that parses raw HTML and XML documents and allows you to use the syntax of jQuery while working with the downloaded data. You can write filter functions to fine-tune which data you want from your selectors.
Note – Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript. If your use case requires any of this functionality
Main features of CheerioJs
- Cheerio works with a very simple, consistent DOM model
- Cheerio implements a subset of core jQuery
- Can parse nearly any HTML or XML document
Cheerio Pros |
Cheerio Cons |
✅ Very fast (Preliminary end to end benchmarks suggests its 8x faster than JSDOM) | ⛔ Not good with SPA applications |
✅ Parsing, rendering, and manipulating documents is very efficient | ⛔ Do not produce a visual rendering |
Github stars: 23k+ | NPM downloads: 3,500,000+ (weekly)
3. Apify SDK
Apify SDK ↱ simplifies the development of web crawlers, scrapers, data extractors, and web automation jobs. It provides tools to manage and automatically scale a pool of headless Chrome / Puppeteer instances, to maintain queues of URLs to crawl, store crawling results to a local filesystem or into the cloud, rotating proxies, and much more.
Main features of Apify SDK
- Perform a deep crawl of an entire website using a persistent queue of URLs.
- Run your scraping code on a list of 100k URLs in a CSV file, without losing any data when your code crashes.
- Rotate proxies to hide your browser origin and keep user-like sessions.
- Disable browser fingerprinting protections used by websites.
Apify SDK Pros |
Apify SDK Cons |
✅ Built-in support for Puppeteer and Cheerio | ⛔ Lack of community support |
✅ Supports any type of website out of the box | ⛔ Steep learning curve |
Github stars: 2.5k+ | NPM downloads: 5,000+ (weekly)
4. SimpleCrawler
Simplecrawler ↱ is designed to provide a basic, flexible, and robust API for crawling websites. It was written to archive, analyze, and search some very large websites and has happily chewed through hundreds of thousands of pages and written tens of gigabytes to disk without issue.
Main features of SimpleCrawler
- Provides a very simple event-driven API using EventEmitter
- Extremely configurable base for writing your own crawler
- Has a flexible queue system which can be frozen to disk and defrosted
- Provides basic statistics on network performance
SimpleCrawler Pros |
SimpleCrawler Cons |
✅ Very easy to start | ⛔ No Javascript promise support |
✅ Highly configurable | ⛔ Only suitable for small projects |
Github stars: 2k+ | NPM downloads: 10,000+ (weekly)
5. Playwright
Playwright ↱ is a Node library to automate multiple browsers with a single API. It enables cross-browser web automation that is ever-green, capable, reliable, and fast. Playwright was created to improve automated UI testing by eliminating flakiness, improving the speed of execution, and offering insights into the browser operation.
Playwright is mostly using to test web-based applications rather than web scraping.
Main features of Playwright
- Single API to automate Chromium, Firefox and WebKit.
- API available for Javascript, Python, C# and GO languages.
- Intercept network activity.
Playwright Pros |
Playwright Cons |
✅ Cross Browser support | ⛔ They have only patched the WebKit and Firefox debugging protocols, not the actual rendering engine |
✅ Detailed documentation |
Github stars: 20k+ | NPM downloads: 98,000+ (weekly)
Conclusion
In my opinion, Puppeteer is the overall choice for me. If you need to start fast, I would recommend going with Cheerio. Each tool has its own advantages and disadvantages. It’s better to select a tool for your needs.