2024-09-30 Web Development

How to Download Images from Google Images Using Puppeteer and Node.js

By O Wolfson

In this article, we'll explore how to create a script that automates the process of downloading images from Google Images using Puppeteer, Node.js, and some helper functions. Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. It's commonly used for web scraping, automating web pages, and running headless browsers.

Prerequisites

To follow along, you should have Node.js and npm installed on your system. You also need to install Puppeteer and Axios by running the following command:

bash
npm install puppeteer axios

Project Structure

Here's a quick overview of the files involved in this project:

index.js: The main script that handles the image downloading process.
search-google-images.js: A helper module to perform the Google Images search based on user input.

The Main Script (`index.js`)

This script launches a Puppeteer browser instance, navigates to the Google Images search results page, filters out irrelevant URLs, and downloads high-resolution images.

Step-by-Step Breakdown

Import Necessary Modules:

javascript
const puppeteer = require("puppeteer");
const fs = require("node:fs");
const path = require("node:path");
const axios = require("axios");
const searchGoogleImages = require("./search-google-images");

Function to Download Images:

The downloadImage function uses Axios to stream and save images to the local file system.

javascript
const downloadImage = async (url, filepath) => {
  const writer = fs.createWriteStream(filepath);
  const response = await axios({
    url,
    method: "GET",
    responseType: "stream",
  });
  response.data.pipe(writer);
  return new Promise((resolve, reject) => {
    writer.on("finish", resolve);
    writer.on("error", reject);
  });
};

Ensure Directory Existence:

This utility function checks if a directory exists and creates it if it doesn't.

javascript
const ensureDirectoryExistence = (dir) => {
  if (!fs.existsSync(dir)) {
    fs.mkdirSync(dir, { recursive: true });
  }
};

Main Function:

The main function launches the Puppeteer browser, navigates to the Google Images search results page, extracts image URLs, and downloads the images.

javascript
(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  const url = await searchGoogleImages();
  await page.goto(url, { waitUntil: "networkidle2" });

  const urls = await page.evaluate(() => {
    const anchors = Array.from(document.querySelectorAll("a"));
    return anchors
      .map((anchor) => anchor.href)
      .filter((href) => href && !href.includes("google"));
  });

  console.log("Filtered URLs found:", urls);
  fs.writeFileSync("filtered_urls.txt", urls.join("\n"), "utf-8");
  console.log("Filtered URLs have been saved to filtered_urls.txt");

  const imagesDir = path.resolve(__dirname, "images");
  ensureDirectoryExistence(imagesDir);

  const MIN_WIDTH = 800;
  const MIN_HEIGHT = 600;

  for (let i = 0; i < urls.length; i++) {
    const imagePage = await browser.newPage();
    try {
      await imagePage.goto(urls[i], { waitUntil: "networkidle2" });
      const imageUrls = await imagePage.evaluate(
        (MIN_WIDTH, MIN_HEIGHT) => {
          const images = Array.from(document.querySelectorAll("img"));
          return images
            .filter(
              (img) =>
                img.naturalWidth >= MIN_WIDTH && img.naturalHeight >= MIN_HEIGHT
            )
            .map((img) => img.src)
            .filter((src) => src?.startsWith("http"));
        },
        MIN_WIDTH,
        MIN_HEIGHT
      );

      for (const imageUrl of imageUrls) {
        const imageFilename = path.basename(new URL(imageUrl).pathname);
        const imageFilepath = path.resolve(imagesDir, imageFilename);
        await downloadImage(imageUrl, imageFilepath);
        console.log(`Downloaded: ${imageFilepath}`);
      }
    } catch (error) {
      console.error(`Failed to process ${urls[i]}:`, error);
    } finally {
      await imagePage.close();
    }
  }

  await browser.close();
})();

The Helper Module (`search-google-images.js`)

This module prompts the user for a search term, navigates to the Google Images search results page, and returns the final URL.

Step-by-Step Breakdown

Import Necessary Modules:

javascript
const puppeteer = require("puppeteer");
const readline = require("node:readline");

Function to Get User Input:

This function prompts the user for a search term.

javascript
const getUserInput = (query) => {
  const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
  });
  return new Promise((resolve) =>
    rl.question(query, (ans) => {
      rl.close();
      resolve(ans);
    })
  );
};

Search Google Images:

This function launches a Puppeteer browser, navigates to the Google Images search results page based on the user input, and returns the final URL.

javascript
const searchGoogleImages = async () => {
  const searchTerm = await getUserInput("Enter the search term: ");

  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  const searchUrl = `https://www.google.com/search?tbm=isch&q=${encodeURIComponent(
    searchTerm
  )}`;
  await page.goto(searchUrl, { waitUntil: "networkidle2" });

  const finalUrl = page.url();
  console.log("Final URL:", finalUrl);

  await browser.close();

  return finalUrl;
};

module.exports = searchGoogleImages;

Conclusion

In this article, we've walked through the process of creating a Node.js script that uses Puppeteer to search for images on Google Images and download high-resolution images. This script can be customized and extended to suit various web scraping and automation needs. The combination of Puppeteer and Node.js offers a powerful and flexible way to interact with web pages programmatically.

Feel free to experiment with the code and adapt it for your own projects! Happy coding!

Prerequisites

Project Structure

The Main Script (index.js)

Step-by-Step Breakdown

The Helper Module (search-google-images.js)

Step-by-Step Breakdown

Conclusion

The Main Script (`index.js`)

The Helper Module (`search-google-images.js`)