Appendix C. Binary data and GridFS

For storing images, thumbnails, audio, and other binary files, many applications rely on the filesystem only. Although filesystems provide fast access to files, filesystem storage can also lead to organizational chaos. Consider that most filesystems limit the number of files per directory. If you have millions of files to keep track of, you need to devise a strategy for organizing files into multiple directories. Another difficulty involves metadata. Because the file metadata is still stored in a database, performing an accurate backup of the files and their metadata can be incredibly complicated.

For certain use cases, it may make sense to store files in the database itself because doing so simplifies file organization and backup. In MongoDB, you can use the BSON binary type to store any kind of binary data. This data type corresponds to the RDBMS BLOB (binary large object) type, and it’s the basis for two flavors of binary object storage provided by MongoDB.

The first uses one document per file and is best for smaller binary objects. If you need to catalog a large number of thumbnails or binary MD5s, using single-document binary storage can make life much easier. On the other hand, you might want to store large images or audio files. In this case, GridFS, a MongoDB API for storing binary objects of any size, would be a better choice. In the next two sections, you’ll see complete examples of both storage techniques.

C.1. Simple binary storage

BSON includes a first-class type for binary data. You can use this type to store binary objects directly inside MongoDB documents. The only limit on object size is the document size limit itself, which is 16 MB since MongoDB v2.0. Because large documents like this can tax system resources, you’re encouraged to use GridFS for any binary objects you want to store that are larger than 1 MB.

We’ll look at two reasonable uses of binary object storage in single documents. First, you’ll see how to store an image thumbnail. Then, you’ll see how to store the accompanying MD5.

C.1.1. Storing a thumbnail

Imagine you need to store a collection of image thumbnails. The code is straightforward. First, you get the image’s filename, canyon-thumb.jpg, and then read the data into a local variable. Next, you wrap the raw binary data as a BSON binary object using the Ruby driver’s BSON::Binary constructor:

require 'rubygems'
require 'mongo'
image_filename = File.join(File.dirname(__FILE__), "canyon-thumb.jpg")
image_data = File.open(image_filename).read
bson_image_data = BSON::Binary.new(image_data)

All that remains is to build a simple document to contain the binary data and then insert it into the database:

doc = {"name" => "monument-thumb.jpg",
       "data" => bson_image_data }
@con = Mongo::Client.new(['localhost:27017'], :database => 'images')
@thumbnails = @con[:thumbnails]
result = @thumbnails.insert_one(doc)

To extract the binary data, fetch the document. In Ruby, the to_s method unpacks the data into a binary string, and you can use this to compare the saved data to the original:

@thumbnails.find({"name" => "monument-thumb-jpg"}).each do |doc|
  if image_data == doc["data"].to_s
    puts "Stored image is equal to the original file!"
  end
end

If you run the preceding script, you’ll see a message indicating that the two files are indeed the same.

C.1.2. Storing an MD5

It’s common to store a checksum as binary data, and this marks another potential use of the BSON binary type. Here’s how you can generate an MD5 of the thumbnail and add it to the document just stored:

require 'md5'
md5 = Digest::MD5.file(image_filename).digest
bson_md5 = BSON::Binary.new(md5, :md5)
@thumbnails.update_one({:_id => @image_id}, {"$set" => {:md5 => bson_md5}})

Note that when creating the BSON binary object, you tag the data with :md5. The subtype is an extra field on the BSON binary type that indicates what kind of binary data is being stored. This field is entirely optional, though, and has no effect on how the database stores or interprets the data.[1]

1

This wasn’t always technically true. The deprecated default subtype of 2 indicated that the attached binary data also included four extra bytes to indicate the size, and this did affect a few database commands. The current default subtype is 0, and all subtypes now store the binary payload the same way. Subtype can therefore be seen as a kind of lightweight tag to be optionally used by application developers.

It’s easy to query for the document just stored, but do notice that you exclude the data field to keep the return document small and readable:

> use images
> db.thumbnails.findOne({}, {data: 0})
{
  "_id" : ObjectId("4d608614238d3b4ade000001"),
  "md5" : BinData(5,"K1ud3EUjT49wdMdkOGjbDg=="),
  "name" : "monument-thumb.jpg"
}

See that the MD5 field is clearly marked as binary data (briefly mentioned in Table 5.6) with the subtype and raw payload. Keep in mind that MongoDB sorts BinData first by the length or size of the data, second by the BSON one-byte subtype, and last by the data, performing a byte-by-byte comparison.

C.2. GridFS

GridFS is a convention for storing files of arbitrary size in MongoDB. The GridFS specification is implemented by all the official drivers and by MongoDB’s mongofiles tool, ensuring consistent access across platforms. GridFS is useful for storing large binary objects in the database. It’s frequently fast enough to serve these objects as well, and the storage method is conducive to streaming.

The term GridFS may lead to confusion, so two clarifications are worth making right off the bat. The first is that GridFS isn’t an intrinsic feature of MongoDB. As mentioned, it’s a convention that all the official drivers (and some tools) use to manage large binary objects in the database. Second, it’s important to clarify that GridFS doesn’t have the rich semantics of bona fide filesystems. For instance, there’s no protocol for locking and concurrency, and this limits the GridFS interface to simple put, get, and delete operations. This means that if you want to update a file, you need to delete it and then put the new version.

GridFS works by dividing a large file into small, 255 KB chunks and then storing each chunk as a separate document—versions prior to MongoDB v2.4.10 use 256 KB chunks. By default, these chunks are stored in a collection called fs.chunks. Once the chunks are written, the file’s metadata is stored in a single document in another collection called fs.files. Figure C.1 contains a simplistic illustration of this process applied to a theoretical 1 MB file called canyon.jpg. Note that the use of the term chunks in the context of GridFS isn’t related to the use of the term chunks in the context of sharding.

Figure C.1. Storing a file with GridFS using 256 KB chunks on a MongoDB server prior to v2.4.10

That should be enough theory to use GridFS.[2] Next we’ll see GridFS in practice through the Ruby GridFS API and the mongofiles utility.

2

C.2.1. GridFS in Ruby

Earlier you stored a small image thumbnail. The thumbnail took up only 10 KB and was thus ideal for keeping in a single document. The original image is almost 2 MB in size, and is therefore much more appropriate for GridFS storage. Here you’ll store the original using Ruby’s GridFS API. First, you connect to the database and then initialize a Grid object, which takes a reference to the database where the GridFS file will be stored.

Next, you open the original image file, canyon.jpg, for reading. The most basic GridFS interface uses methods to put and get a file. Here you use the Grid#put method, which takes either a string of binary data or an IO object, such as a file pointer. You pass in the file pointer and the data is written to the database.

The method returns the file’s unique object ID using the latest Ruby MongoDB driver:

require 'rubygems'
require 'mongo'
include Mongo
$client = Mongo::Client.new([ '127.0.0.1:27017' ], :database => 'images')
fs = $client.database.fs
$file = File.open("canyon.jpg")
$file_id = fs.upload_from_stream("canyon.jpg", $file)
$file.close

As stated, GridFS uses two collections for storing file data. The first, normally called fs.files, keeps each file’s metadata. The second collection, fs.chunks, stores one or more chunks of binary data for each file. Let’s briefly examine these from the shell.

Switch to the images database, and query for the first entry in the fs.files collection. You’ll see the metadata for the file you just stored:

> use images
> db.fs.files.find({filename: "canyon.jpg"}).pretty()
{
  "_id" : ObjectId("5612e19a530a6919ed000001"),
  "chunkSize" : 261120,
  "uploadDate" : ISODate("2015-10-05T20:46:18.849Z"),
  "contentType" : "binary/octet-stream",
  "filename" : "canyon.jpg",
  "length" : 281,
  "md5" : "597d619c415a4db144732aed24b6ff0b"
}

These are the minimum required attributes for every GridFS file. Most are self-explanatory. You can see that this file is about 2 MB and is divided into chunks 256 KB in size, which means that it was from a MongoDB server prior to v2.4.10. You’ll also notice an MD5. The GridFS spec requires a checksum to ensure that the stored file is the same as the original.

Each chunk stores the object ID of its file in a field called files_id. Thus you can easily count the number of chunks this file uses:

> db.fs.chunks.count({"files_id" : ObjectId("4d606588238d3b4471000001")})
8

Given the chunk size and the total file size, eight chunks are exactly what you should expect. The contents of the chunks themselves are easy to see, too. As earlier, you’ll want to exclude the data to keep the output readable. This query returns the first of the eight chunks, as indicated by the value of n:

> db.fs.chunks.findOne({files_id: ObjectId("4d606588238d3b4471000001")},
          {data: 0})
{
  "_id" : ObjectId("4d606588238d3b4471000002"),

  "n" : 0,
  "files_id" : ObjectId("4d606588238d3b4471000001")
}

Reading GridFS files is as easy as writing them. In the following example, you create a text file on-the-fly, give it a name, and store it using GridFS. You then find it in the database using a find_one() statement that returns a Mongo::Grid::File object. Then you have to get the file ID from the Mongo::Grid::File object to use it and retrieve the text file from the database, which is saved using the perfectCopy filename:

require 'rubygems'
require 'mongo'
include Mongo

$client = Mongo::Client.new([ '127.0.0.1:27017' ], :database => 'garden')
fs = $client.database.fs

# To create a text file with raw data
file = Mongo::Grid::File.new('I am a NEW file', :filename => 'aFile.txt')
$client.database.fs.insert_one(file)

# Select the file from scratch
$fileObj = $client.database.fs.find_one(:filename => 'aFile.txt')
$file_id = $fileObj.id

# And download it
$file_to_write = File.open('perfectCopy', 'w')
fs.download_to_stream($file_id, $file_to_write)

You can then verify for yourself that perfectCopy is a text file with the correct data in it:

$ cat perfectCopy
I am a NEW filei

That’s the basics of reading and writing GridFS files from a driver. The various GridFS APIs vary slightly, but with the foregoing examples and the basic knowledge of how GridFS works, you should have no trouble making sense of your driver’s docs. At the time of writing, the latest Ruby MongoDB Driver is v2.1.1.

C.2.2. GridFS with mongofiles

The MongoDB distribution includes a handy utility called mongofiles for listing, putting, getting, and deleting GridFS files using the command line. For example, you can list the GridFS files in the images database:

$ mongofiles -d images list
connected to: 127.0.0.1
canyon.jpg  2004828

You can also easily add files. Here’s how you can add the copy of the image that you wrote with the Ruby script:

$ mongofiles -d images put canyon-copy.jpg
connected to: 127.0.0.1
added file: { _id: ObjectId('4d61783326758d4e6727228f'),
              filename: "canyon-copy.jpg",
              chunkSize: 262144, uploadDate: new Date(1298233395296),
              md5: "9725ad463b646ccbd287be87cb9b1f6e", length: 2004828 }

You can again list the files to verify that the copy was written:

$ mongofiles -d images list
connected to: 127.0.0.1
canyon.jpg  2004828
canyon-copy.jpg  2004828

mongofiles supports a number of options, and you can view them with the --help parameter:

$ mongofiles --help
Usage:
  mongofiles <options> <command> <filename or _id>

Manipulate gridfs files using the command line.

Possible commands include:
    list      - list all files; 'filename' is an optional prefix which listed
                filenames must begin with
    search    - search all files; 'filename' is a substring which listed
                filenames must contain
    put       - add a file with filename 'filename'
    get       - get a file with filename 'filename'
    get_id    - get a file with the given '_id'
    delete    - delete all files with filename 'filename'
    delete_id - delete a file with the given '_id'

See http://docs.mongodb.org/manual/reference/program/mongofiles/ for more information.

general options:
      --help                     print usage
      --version                  print the tool version and exit

verbosity options:
  -v, --verbose      more detailed log output (include multiple times for more
                     verbosity, e.g. -vvvvv)
      --quiet        hide all log output

connection options:
  -h, --host=         mongodb host to connect to (setname/host1,host2 for
                      replica sets)
      --port=         server port (can also use --host hostname:port)

authentication options:
  -u, --username=                username for authentication
  -p, --password=                password for authentication

      --authenticationDatabase=  database that holds the user's credentials
      --authenticationMechanism= authentication mechanism to use

storage options:
  -d, --db=                      database to use (default is 'test')
  -l, --local=                   local filename for put|get
  -t, --type=                    content/MIME type for put (optional)
  -r, --replace                  remove other files with same name after put
      --prefix=                  GridFS prefix to use (default is 'fs')
      --writeConcern=    write concern options e.g. --writeConcern majority,
                         --writeConcern '{w: 3, wtimeout: 500, fsync: true, j:
                         true}' (defaults to 'majority')
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset