Here at Fresh Relevance, we process high volumes of real-time data.
Whilst building our product, which provides real-time emails to recover abandoned carts for eCommerce sites, we've found a few things:
Drop, don't delete
We take high-volume streams of incoming event data from our clients' shopping cart systems. Initially, we insert these into a collection, waiting to be processed. Once they're processed, we have no further use for them. This left us with some options for how to clear up the temporary data :
- Remove each record after we've processed it
- Remove all records at the end
- Drop the collection at the end
- Drop the database at the end
I created a set of tests on github that create a database with 100,000 documents (~1.5GB) in it, then attempt to get rid of them.
1 Note that there's some overhead in this test due to looping through each ObjectID to call remove against it. However, you'd be doing something similar in real life.
The test is fairly artificial - all the data items are the same shape and contain the same size of data for consistency, however it does show that, for larger collections / databases, it can often be more efficient to drop the whole collection or database than remove the items within it.
Our system is architected so we can drop all the collections in a database at once, so it's more efficient for us to drop the whole database and all the collections within it. This allows us to rotate out old data at low cost.
Schemaless is Great
One of the main attractions of a No SQL database like Mongo is the lack of pre-defined schemas. This works well for us, as we have different versions of our scripts pushing in different shapes of data from different cart systems, with different fields present.
Thankfully the DB needs to know very little about the data (bar any indexes) and our code just needs to cope gracefully with missing fields.
Schemas are Great
At the same time, having a schema can be a very useful safety net - what happens if one of our data sources is missing fields? What if there's a typo in a fieldname somewhere? Our data is doomed!?!
We wrote and use a schema validator based on https://github.com/JamesCropcho/variety/ to allow us to run tests on our scripts, then check the shape of the data against a set of expectations.
Fork it here: http://dhendo.github.com/node-mongodb-schema-validator/