Compare two large sets, title

I recently had to find in one of our big database if a large set a users did a specific action. With no time frame. That means that potentially millions of ids must be compared to dozens of millions actions, and growing.

This beginning to be small big data…

MongoDB

Our visitors database is managed by a MongoDB replica set. Of course, my first attempt was to build my query with $or:

db.users.count({ $or : [ { "uuid" : 123 }, { "uuid" : 124 } ... ] });

But I was disappointed by the performance, the query could take more 30 minutes… This is really not what we ask to a noSql database! And, as the data is growing every day and the query will certainly be run more than once a day, it was not acceptable.

The Solution

After a lot of tries to improve it I was stuck, and the solution came from an amazing answer on StackOverflow from Pascal Bugnion : $in.

I was not a big fan of $in in MongoDB, because one of my script performed very bad when I was including a $in, even on an indexed field.

And indeed, as benchmarked by Pascal, my query which took more that 30 Minutes, in less than 9.46 seconds. This is pretty spectacular and more what I expect from MongoDB (note that the Mongo is version 3 but I did not use the new WiredTiger -yet).

30+ minutes to < 10 seconds

So the new query looks more like:

db.users.count({ "uuid" : { $in : [ 123 , 124 , 125 ... ] } });

If you think you can improve that, do not hesitate!

Links