Indexing in Practice using MongoDB

This article is based on MongoDB in Action ( Buy from Amazon), to be published on July, 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘java40beat’ and get 40% discount on eBooks and pBooks ]

Indexing in Practice

Introduction

Indexes are enormously important. With the right indexes in place, MongoDB can use its hardware efficiently and serve your application’s queries quickly. With the wrong indexes, you’ll see the exact opposite effect: slow queries and poorly utilized hardware. It stands to reason, then, that anyone wanting to use MongoDB effectively and make the best use of hardware resources must understand indexing.

We’re going to look at some refinements on the kinds of indexes that can be created in MongoDB. We’ll then proceed to some of the niceties of administering those indexes.

Index types

All indexes in MongoDB use the same underlying data structure, but several kinds of indexes are nevertheless permitted. In particular, unique, sparse, and multikey indexes are frequently used, and here we describe them in some detail.

Unique indexes

To create a unique index, specify the unique option:

1	db.users.ensureIndex({username: 1}, {unique: true})

Unique indexes ensure that all index entries are unique. Thus, if we try to insert a document into the collection with an already-indexed username, users that insert will fail with the following exception:

1	E11000 duplicate key error index: gardening.users.$username_1 dup key: { : "kbanker" }

If using a driver, this exception will be caught only if you perform the insert using your driver’s safe mode. If you need a unique index on a collection, it’s usually best to create the index before inserting any data. If you create the index in advance, you guarantee the uniqueness constraint from the start. Creating a unique index on a collection that already contains data, you run the risk of failure since it’s entirely possible that duplicate keys will exist in the collection. When duplicate keys exist, the index creation fails.

If you do find yourself needing to create a unique index on an established collection, you have a couple of options. The first is to repeatedly attempt to create the unique index and use the failure messages to intelligently remove the documents with duplicate keys. But if the data isn’t so important, you can also instruct the database to drop documents with duplicate keys automatically using the dropDups option. So, to take an example, if our users collection already contains data and if we don’t care that documents with duplicate keys are removed, then we can issue the index creation command like this:

1	db.users.ensureIndex({username: 1}, {unique: true, dropDups: true})

Sparse indexes

Indexes are dense by default. This means that, for every document, in an indexed collection, there will be a corresponding entry in the index even if the document lacks the indexed key. Suppose you have a products collection in an e-commerce data model, and imagine that you’ve built an index on the product attribute category_ids. Now, suppose that there are a few products that haven’t been assigned to any categories. For each of these category-less products, there will still exist a null entry in the category_ids index. We can query for those null values:

1	db.products.find({category_ids: null})

When searching for all products lacking a category, the query optimizer will still be able to use the category_ids index to locate those products. When the number of null values in the index is small, this is still a good thing because this query can use the index.

But, there are two cases where a dense index is undesirable. The first is when you want a unique index on a field that doesn’t appear in every document in the collection. For instance, we definitely want a unique index on every product’s sku field. But, suppose that, for some reason, products are entered into the system before a sku is assigned. If we have a unique index on sku and we attempt to insert more than one product without a sku, then those inserts will fail because there will already be an entry in the index where sku is null. This is, therefore, a case where a dense index doesn’t serve our purpose. What we want instead is a sparse index.

In a sparse index, only those documents having a non-null value for the indexed attribute will appear. If we want to create a sparse index in MongoDB, all we have to do is specify {sparse: true}. So, for example, we can create a unique, sparse index on sku like so:

1	db.products.ensureIndex({sku: 1}, {unique: true, sparse: true})

There’s another case where a sparse index is desirable: when a large number of documents in a collection don’t contain the indexed key. For example, suppose we allowed anonymous reviews on our e-commerce site. In this case, half the reviews might lack a user_id field and, if that field were indexed, then half the entries in that index would be null. This would be pretty inefficient for two reasons. First, it would increase the size of the index. Second, it would require updates to the index when adding and removing documents with null user_id fields.

If we rarely (or never) expected queries on anonymous reviews, we might elect to build a sparse index on user_id. Again, setting the sparse option is simple:

1	db.reviews.ensureIndex({user_id: 1}, {sparse: true})

Now, only those reviews linked to a user via the user_id field would be indexed.

Multikey indexes

Indexing fields whose values are arrays is made possible by what’s known as a multikey index, which allows multiple entries in the index to reference the same document. This makes sense if we take a simple example. Suppose we have a product document with a few tags like this:

1
2
3

{ name: "Wheelbarrow",
		tags: ["tools", "gardening", "soil"]
	}

If we create an index on tags, then each value in this document’s tags array will appear in the index. This means that a query on any one of these array values can use the index to locate the document. This is the idea behind a multikey index: multiple index entries, or keys, end up referencing the same document.

Multikey indexes are always enabled in MongoDB. Anytime an indexed field contains an array, each array value will be given its own entry in the index. The intelligent use of multikey indexes is essential to proper MongoDB schema design.

Index administration

When it comes to administering indexes in MongoDB, there may be some gaps in your understanding. Here we’ll examine index creation and deletion in detail and address questions surrounding compaction and backups.

Creating and deleting indexes

If you’ve created a few indexes, there should be no mysteries surrounding the index creation syntax. Simply call one of the index creation helper methods, either in the shell or with your language of choice and a document defining the new index will be placed into the special system.indexes collection.

While it’s usually easier to use a helper method to create an index, you can insert an index specification manually (this is, after all, what the helper methods do). You just need to be sure you’ve specified the minimum set of keys: ns, key, and name. ns is the namespace, key is the field or combination of fields to index, and name is a name used to refer to the index. Any additional options, like sparse, can also be specified here. So to have an example, let’s create a sparse index on our users collection:

1 2	spec = {ns: "green.users", key: {‘addresses.zip': 1}, name: ‘zip'} db.system.indexes.insert(spec, true)

If no errors are returned on insert, then the index now exists and we can query the system.indexes collection to prove it.

1
2
3

db.system.indexes.find()
	{ "_id" : ObjectId("4d2205c4051f853d46447e95"), "ns" : "green.users",
	"key" : { "addresses.zip" : 1 }, "name" : "zip", "v" : 0 }

If you’re running MongoDB v1.8 or later, you’ll see that an extra key, v, has been added. This version field allows for future changes in the internal index format but should be of little concern to application developers.

To delete an index, you might think that all you need do is remove the index document from system.indexes, but this operation is prohibited. Instead, you can delete indexes using the database command deleteIndexes. As with index creation, there are helpers for deleting indexes but, if you want to run the command itself, you can do that too. It takes as its argument a document containing the collection name and either the name of the index to drop or “*” to drop all indexes. To manually drop the indexes we just created, issue the command like so:

1 2	use green db.runCommand({deleteIndexes: "users", index: "zip"})

That’s the basics of creating and deleting indexes. For what to expect when an index is created, read on.

Building indexes

Most of the time, you’ll want to declare your indexes before putting your application in production. This allows indexes to be built incrementally, as the data is inserted. But there are two cases where you might choose to build an index after the fact. The first case is when you need to import a lot of data before switching to production. For instance, you might be migrating an application to MongoDB and need to seed the database with user information from a data warehouse. You could certainly create the indexes on your user data in advance, but doing so after you’ve imported the data will ensure an ideally balanced index from the start. Furthermore, the net time updating will be minimized. The second and more obvious case of creating indexes on existing data sets is when you have to optimize for new queries.

Regardless of why you’re creating new indexes, the process isn’t always pleasing. For large data sets, building an index can take hours, even days. You can see its progress in the MongoDB logs. Let’s take an example from a data set that we’ll be using in the next section. First, we declare an index to be built:

1	db.values.ensureIndex({open: 1, close: 1})

The index builds in two steps. In the first step, the values to be indexed are sorted. A sorted data set makes for a much more efficient insertion into the B-tree. Notice that the progress of the sort is indicated by the ratio of the number of documents sorted to the total number of documents:

Tue Jan 4 09:58:17 [conn1] building new index on { open: 1.0, close: 1.0 } for stocks.values
	1000000/4308303 23%
	2000000/4308303 46%
	3000000/4308303 69%
	4000000/4308303 92%
	Tue Jan 4 09:59:13 [conn1] external sort used : 5 files in 55 secs

For step two, the sorted values are inserted into the index. Progress is indicated in the same way, and, when complete, the time it took to complete the index build is indicated as the insert time into system.indexes:

1200300/4308303 27%
	2227900/4308303 51%
	2837100/4308303 65%
	3278100/4308303 76%
	3783300/4308303 87%
	4075500/4308303 94%
	Tue Jan 4 10:00:16 [conn1] done building bottom layer, going to commit
	Tue Jan 4 10:00:16 [conn1] done for 4308303 records 118.942secs
	Tue Jan 4 10:00:16 [conn1] insert stocks.system.indexes 118942ms

In addition to examining the MongoDB log, you can check the index build progress by running the currentOp() command:

> db.currentOp()
	{
		"inprog" : [
		{
			"opid" : 58,
			"active" : true,
			"lockType" : "write",
			"waitingForLock" : false,
			"secs_running" : 55,
			"op" : "insert",
			"ns" : "stocks.system.indexes",
			"query" : {
			},
			"client" : "127.0.0.1:53421",
			"desc" : "conn",
			"msg" : "index: (1/3) external sort 3999999/4308303 92%"
		}
		]
	}

WARNING: Be careful declaring indexes

Because it’s so easy to declare indexes, it’s also quite easy to inadvertently trigger an index build. If the data set is large enough, then the build will take a long time. And, in a production situation, this can be a nightmare since there’s no easy way to kill an index build. If this ever happens to you, you’ll have to fail over to a secondary node—if you have one. But the most prudent advice is to treat an index build as a kind of database migration and ensure that your application code never declares indexes automatically.

BACKGROUND INDEXING

If you’re running in production and simply can’t afford to halt access to the database, you can specify that an index be built in the background. Although the index build will still take a write lock, the job will pause to allow other readers and writers to access the database. If your application typically exerts a heavy load on MongoDB, then a background index build will certainly degrade performance, but this may be acceptable under certain circumstances. For example, if you know that the index can be built within a time window where application traffic is at a minimum, background indexing in this case might be a good choice. To build an index in the background, simply specify {background: true when you declare the index. The previous index can be built in the background like so:

1	db.values.ensureIndex({open: 1, close: 1}, {background: true})

OFFLINE INDEXING

If your production data set is too large to be indexed within a few hours, then a more sophisticated plan will be required. This will usually involve taking a replica node offline, building the index on that node by itself, and then allowing the node to catch up with the master replica. This does presume that your replication oplog is large enough to prevent the offline node from becoming stale during the index build.

Backups

Because indexes are hard to build, you definitely want to back them up. Unfortunately, not all backup methods include indexes. For instance, you might be tempted to use mongodump and mongorestore, but these utilities handle documents only. This means that, when you run mongorestore, all the indexes defined for any collections you’ve backed up will be recreated. As always, if your data set is large, the time it takes to build all indexes may be unacceptable.

Consequently, if you want your backups to include indexes, then you’ll want to opt for backing up the MongoDB data files themselves.

Compaction

If your application heavily updates existing data or performs a lot of large deletions, then you may end up with a highly fragmented index. The primary symptom of this is an index size much larger than you’d expect for the given data size, but you may also see certain performance problems. In these cases, you may want to consider rebuilding one or more indexes. You can do this by recreating individual indexes or by running the reIndex command, which will rebuild all indexes in a given collection.

1	db.values.reIndex();

Be careful about reindexing: the command will take out a write lock for the duration of the rebuild, rendering your MongoDB instance unusable for that time. This is best done offline.

Summary

For many developers, indexes are a topic shrouded in mystery. This need not be the case. The purpose of this article was to help you develop a mental model for thinking clearly about indexes. We discussed unique, sparse, and multikey indexes, and provided a number of pointers on index administration.