Here at Fresh Relevance, we process high volumes of real-time data.
Whilst building our product, which provides real-time emails to recover abandoned carts for eCommerce sites, we've found a few things:
We take high-volume streams of incoming event data from our clients' shopping cart systems. Initially, we insert these into a collection, waiting to be processed. Once they're processed, we have no further use for them. This left us with some options for how to clear up the temporary data :
I created a set of tests on github that create a database with 100,000 documents (~1.5GB) in it, then attempt to get rid of them.
1 Note that there's some overhead in this test due to looping through each ObjectID to call remove against it. However, you'd be doing something similar in real life.
The test is fairly artificial - all the data items are the same shape and contain the same size of data for consistency, however it does show that, for larger collections / databases, it can often be more efficient to drop the whole collection or database than remove the items within it.
Our system is architected so we can drop all the collections in a database at once, so it's more efficient for us to drop the whole database and all the collections within it. This allows us to rotate out old data at low cost.
One of the main attractions of a No SQL database like Mongo is the lack of pre-defined schemas. This works well for us, as we have different versions of our scripts pushing in different shapes of data from different cart systems, with different fields present.
Thankfully the DB needs to know very little about the data (bar any indexes) and our code just needs to cope gracefully with missing fields.
At the same time, having a schema can be a very useful safety net - what happens if one of our data sources is missing fields? What if there's a typo in a fieldname somewhere? Our data is doomed!?!
We wrote and use a schema validator based on https://github.com/JamesCropcho/variety/ to allow us to run tests on our scripts, then check the shape of the data against a set of expectations.
Fork it here: http://dhendo.github.com/node-mongodb-schema-validator/