Creating your own image dataset using Google images

Let’s say, for whatever reason, we need to determine what kind of dog a picture is of, but we do not have any pictures readily available on our computer. What can we do? Well, perhaps the easiest approach is to open Google Chrome and search for the images online.

As an example, let’s say we are interested in Doberman dogs. Just open Google Chrome and search for doberman pictures as shown below:

Perform a search for Doberman pictures: On searching, following the result were obtained:

Open the JavaScript console: You can find the JavaScript Console in Chrome in the top-right menu:

Click on More tools and then Developer tools:

Make sure that you select the Console tab, as follows:

Using JavaScript: Continue to scroll down until you think you have enough images for your use case. Once this is done, go back to the Console tab in Developer tools, and then copy and paste the following script:

//the jquery  is pulled down in the JavaScript console
var script = document.createElement('script');
script.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js";
document.getElementsByTagName('head')[0].appendChild(script);
//Let us get the URLs
var urls = $('.rg_di .rg_meta').map(function() { return JSON.parse($(this).text()).ou; });
// Now, we will write the URls one per line to file
var textToSave = urls.toArray().join('
');
var hiddenElement = document.createElement('a');
hiddenElement.href = 'data:attachment/text,' + encodeURI(textToSave);
hiddenElement.target = '_blank';
hiddenElement.download = 'urls.txt';
hiddenElement.click();

This code snippet collects all the image URLs and saves them to a file called urls.txt in your default Downloads directory.

Use Python to download the images: Now, we will use Python to read the URLs of the images from urls.txt and download all the images into a folder:

This can be done easily by following the following steps:

Open a Python notebook and copy and paste the following code to download the images:

# We will start by importing the required pacages
from imutils import paths
import argparse
import requests
import cv2
import os

After importing, start constructing the arguments, and after constructing parsing the arguments is important:

ap = argparse.ArgumentParser()
ap.add_argument("-u", "--urls", required=True,
help="path to file containing image URLs")
ap.add_argument("-o", "--output", required=True,
help="path to output directory of images")
args = vars(ap.parse_args())

The next step includes grabbing the list of URLs from the input file counting total number of images downloaded:

rows = open(args["urls"]).read().strip().split("
")
total = 0
# URLs are looped in
for url in rows:
try:
# Try downloading the image
r = requests.get(url, timeout=60)
#The image is then saved to the disk
p = os.path.sep.join([args["output"], "{}.jpg".format(
str(total).zfill(8))])
f = open(p, "wb")
f.write(r.content)
f.close()
#The counter is updated
print("[INFO] downloaded: {}".format(p))
total += 1

During the download process, the exceptions that are thrown need to be handled:

print("[INFO] error downloading {}...skipping".format(p))

The image paths that are downloaded need to be looped over:

for imagePath in paths.list_images(args["output"])

Now, decide whether the image should be deleted or not and accordingly initialize:

delete = False

The image needs to be loaded. Let's try to do that:

image = cv2.imread(imagePath)

If we weren't able to load the image properly, since the image is None, then it should be deleted from the disk:

if image is None:
delete = True

Also, if OpenCV was unable to load the image, it means the image is corrupt and should be deleted:

except:
print("Except")
delete = True

Give a final check and see whether the image was deleted:

if delete:
print("[INFO] deleting {}".format(imagePath))
os.remove(imagePath)

With that complete, let’s download this notebook as a Python file and name it image_download.py. Make sure that you place the urls.txt file in the same folder as the Python file that you just created. This is very important.
Next, we need to execute the Python file we just created. We will do so by using the command line as shown here (make sure your path variable points to your Python location):

Image_download.py --urls urls.txt --output Doberman

By executing this command, the images will be downloaded to the folder named Doberman. Once this has been completed, you should see all the images of the Doberman that you viewed in Google Chrome, like what is shown in the following image:

Select the required folder for saving the images as shown:

That's it we now have a folder full of Doberman images. The same method can be applied to create a folder of any other type of category that we may need.

There may be a number of images that are part of the Google image results that are not desirable. Ensure that you browse through the images and remove any unwanted images.

Table of Contents for Creating your own image dataset using Google images

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating your own image dataset using Google images