Writing a “count distinct” query in RethinkDB

Writing queries to compute aggregates in RethinkDB is fairly simple; not so for “count distinct” style queries.

For this example we want to write a ReQL query that looks like this:

select count(distinct IP), DbID from logs;

The first step is to group the data by your aggregation field, and add a tracking object. We’ll use this map for counting later. Note here that it is tempted to declare this as a nested object inside the return – for me this did not work.

In addition, RethinkDB internally creates it’s own Javascript objects, which is what allows you to execute a series of functions on them. You can’t do “var o = {}”, or it will fail later.

r.db('performance')
 .table('query_timings')
 .group('DbId')
 .map(function(row) {
    var IP = row('IP');
    
    var o = r.object(IP, 1);
   
    return {
      DbId: row('DbId'),
      IP: IP,
      o
    }
  })

To do the counting, we find our previous counting map and merge it (duplicate keys get overridden – that’s where the distinct comes in). Once complete, you can also return the list of distinct values if you like. Again, resist the temptation to think that because this is Javascript you can use existing Javascript code to merge objects (e.g. extend/apply or the “hasOwnProperty” method) – these will not work.

  .reduce(function(left, right) {
    var qL = left('o');
    var qR = right('o');
  
    var m = qL.merge(qR);
      
    return {
      DbId: left('DbId'),
      IPs: m.keys(),
      found: m.keys().count(),
      o: m
    }
  })

Once you have this done, you’ll have objects with two keys, “group” and “reduction”. You can fix this by ungrouping the results like so:

  .ungroup()
  .map(function(group) {
    return {
        'DbId': group('group'),
        'found': group('reduction')('found')
    }
  })

And after a lot of work, you get your data back:

[
  {
    "DbId":  "db027ceeffa00749a2b9e6a11f0003c13c" ,
    "found": 2
  }
]