Monday, August 17, 2020

Mongo DB interview

 Hi Friends,


In this post, I am haring interview questions asked on Mongo DB.

This interview was happened at 13 years of experience and on Technical Architect level.


Question 1:

What is the maximum size for a document in Mongo DB?

Answer:

Maximum size for a document in Mongo DB is 16MB.


Question 2:

If file to be stored is larger than BSON-document size limit of 16MB, then how Mongo DB stores that file?

Answer:

For that purpose, we can use GridFS. GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB.


GridFS does not support multi-document transactions.


Instead of storing a file in a single document, GridFS divides the file into parts, or chunks [Not related to chunks in Mondo DB sharding]  and stores each chunk as a separate document. By default, GridFS uses a default chunk size of 255 KB, that is, GridFS divides a file into chunks of 255 KB with the exception of the last chunk.

The last chunk is only as large as necessary.

Similarly , files that are no larger than the chunk size only have a final chunk, using only as much space as needed plus some additional metadata.

GridFS uses two collections to store files.One collection stores the file chunks, and the other stores file metadata.

When we query GridFS for a file, the driver will reassemble the chunks as needed. We can perform range queries on files stored through GridFS. We can also access information from arbitrary sections of files, such as to "skip" to the middle of a video or audio file.

GridFS is useful not only for storing files that exceed 16MB but also for storing any files for which we want access without having to load the entire file into memory.


When to use GridFS?


In MongoDB, use GridFS for storing files larger than 16MB.

In some situations, storing large files may be more efficient in a MongoDB database than on a system-level file system.

  • If the file system limits the number of files in a directory, we can use GridFS to store as many files as needed.
  • Wen we want to access information form portions of large files without having to load whole files into memory, we can use GridFS to recall sections of files without reading the entire file into memory.
  • When we want to keep our files and metadata automatically synced and deployed across a number of systems and facilities, we can use GridFS. When using geographically distributed replica sets, Mongo DB can distribute files and their metadata automatically to a number of mongodb instances and facilities. 

 

Do not use GridFS, if we need to update the content of entire file atomically. As an alternative , we can store multiple versions of each file and specify the current version of the file in the metadata. We can update the metadata field that indicates "latest" status in an atomic update after uploading the new version of the file and later remove previous versions if needed.

Furthermore, if our files are all smaller than the 16MB BSON document size limit, consider storing each file in a single document instead of using GridFS. We may use the BinData data type to store the binary data.


Use GridFS:

To store and retrieve files using GridFS, use either of the following:

  • A MongoDB driver.
  • The mongofiles command-line tool.


GridFS Collections:

GridFS stores file in two collections:

  • Chunks store the binary chunks.
  • files stores the file's metadata.


GridFS places the collections  in a common bucket by prefixing each with the bucket name. By default, GridFS uses two collections with a bucket named fs.

  • fs.files
  • fs.chunks


We can choose a different bucket name, as well as create multiple buckets in a single database. The full collection name, which includes the bucket name, is subject to the namespace length limit.


The chunks collection:

Each document in the chunks collection represents a distinct chunk of a file as represented in GridFS. 

Documents in this collection have the following form:

{

    "_id" : <ObjectId>

    "files_id" : <ObjectId>

    "n" : <num>

    "data" : <binary>

}

A document from the chunks collection contains the following fields:

chunks._id  : The unique ObjectId of the chunk.

chunks.files_id : The _id of the parent document, as specified in the files collection.

chunks.n : The sequence number of the chunk. GridFS numbers all chunks, starting with 0.

chunks.data : The chunk's payload as a BSON binary type. 


The files collection:

Each document in the files collection represents a file in GridFS.

{

    "_id" : <ObjectId>,

    "length" : <num>,

    "chunkSize" : <num>,

    "uploadDate" : <timestamp>,

    "md5" : <hash>,

    "filename" : <string>,

    "contentType" : <string>,

    "aliases" : <string array>,

    "metadata" : <any>

}

Documents in the files collection contain some or all of the following fields:

files._id : The unique identifier for this document. The _id is of the data type we chose for the original document. The default type for MongoDB documents is BSON ObjectId.

files.length :  The size of the document in bytes.

files.chunkSize : The size of each chunk in bytes. GridFS divides the document into chunks of size chunkSize, except for the last, which is only as large as needed. The default size is 255 kilobytes (KB).

files.uploadDate : The date the document was first stored by GridFS. This value has the Date type

files.md5 : It is deprecated. 

files.filename :  It is optional. A human-readable name for the GridFS file.

files.contentType : It is also deprecated.

files.metadata : It is also optional. The metadata field may be of any data type and can hold any additional information we want to store. If we wish to add additional arbitrary fields to documents  in the files collection, add them to an object in the metadata field.    


Question 3:

What are GridFS indexes?

Answer:

GridFS uses indexes on each of the chunks and files collections for efficiency. Drivers that conform to the GridFS specification automatically create these indexes for convenience. We can also create any additional indexes as desired to suit our application's needs.

The chunks index:

GridFS uses a unique, compound index on the chunks collection using the files_id and n fields.

This allows for efficient retrieval of chunks, as demonstrated in the following example:

db.fs.chunks.find({ files_id: myFileID}).sort({n:1})

Drivers that conform to the GridFS specification will automatically ensure that this index exists before read and write operations. 

If this index does not exist, we can issue the following operation to create it using the mongo shell:

db.fs.chunks.createIndex({files_id: 1, n : 1}, {unique : true});


The files index:

GridFS uses an index on the files collection using the filename and uploadDate fields. This index allows for efficient retrieval of files, as shown in the example:

db.fs.files.find({filename : myFileName}).sort({uploadDate : 1});


How to do sharding of GridFS?

There are two collections to consider with GridFS : files and chunks.

chunks collection : To shard the chunks collection, use either {files_id : 1, n : 1} or {files_id : 1} as the shard key index. files_id is an objectid and changes monotonically.

For MongoDB drivers that do not run filemd5 to verify successful upload, we can use hashed sharding for the chunks collection.

If the MongoDB driver runs filemd5, we cannot use hashed sharding.


files collection:

The files collection is small and only contains metadata. None of the required keys for GridFS lend  themselves to an even distribution in sharded environment. Leaving files unsharded allows all the file metadata documents to live on the primary shard.

    

That's all for this post.

Thanks for reading!!

No comments:

Post a Comment

CAP Theorem and external configuration in microservices

 Hi friends, In this post, I will explain about CAP Theorem and setting external configurations in microservices. Question 1: What is CAP Th...