In a previous article I proposed a thought experiment: one might model SQL Execution Plans in Javascript, and use this to design a database that supports example plans. Such a system might allow for new query languages to be built, or drop-in algorithms in a data processing application (for instance, one might want to be able to do “create index” for a LINQ query that is actually against in-memory data).
In this article I will explore the problem from a different angle: a data structure that can to handle massive amounts of data in memory.
In both R and Python/Pandas there is rich support for an object called a DataFrame, which bears strong resemblance to a table. If you load files into these, eventually you will find one large enough to run out of memory, even if you have a lot of RAM. Once this happens, you have to resort to other techniques – it would be preferable to keep the programming model, and have the tools handle data volume gracefully.
First, it is valuable to have a fairly large database table to test. Google n-grams provides good examples: this is word and phrase counts from Google books (scanned books), grouped by year. There are many thousands of these files – I chose the file of two word word phrases starting with “th”, which is 30 GB of text.
The following unix command splits the file into small blocks- I determined the size of the blocks experimentally, aiming to average around 64k, as this is a plausible value for disk block sizes in a relational database.
split -l 2500 -a 4 googlebooks-engall-2gram-20120701-th
This generates loads of files, which are named sequentially, with names like xaaaa, xaaab, xaaac, aaaad, etc.
These files can be loaded trivially in R into a dataframe, like so, wherease the full file wouldn’t support this:
data class(data)
[1] "data.frame"
Year ranges run from 1800-2008, here is an example of a few rows:
1800-2009.
“phrase” “year” “# times” “# books” The stupid 1800 6 6 The stupid 1801 13 13 The stupid 1802 11 11 The stupid 1803 11 8 The stupid 1804 12 12 The stupid 1805 3 3 The stupid 1806 6 6 The stupid 1807 6 6 The stupid 1808 3 3 The stupid 1809 11 11 The stupid 1810 17 17 The stupid 1811 17 16 The stupid 1812 4 4 The stupid 1813 7 5
You can also run queries on this data, although this obviously only searches the loaded block.
subset(data, select=c(word, year))
2489 THE GROWL 1885
2490 THE GROWL 1887
2491 THE GROWL 1903
2492 THE GROWL 1908
2493 THE GROWL 1952
2494 THE GROWL 1956
2495 THE GROWL 1958
subset(data, select=c(word, year), subset=(year == 1968))
word year
140 THE ED_NOUN 1968
296 THE EFFORT_NOUN 1968
356 THE EMBROIDERY_NOUN 1968
535 THE ENCYCLOPAEDIA 1968
A typical filesystem would store blocks in a B+ tree for quick insertion and lookup. These trees are usually short and wide, which reduces the hops to find a record, and track from one node to the next at the leaves, which supports scanning ranges of the tree.
I found a workable Javascript implementation of a B+ tree, which I’ve used for demonstration in this article.
The API for this implementation is pretty easy to follow. It has an interesting property that when you do “seek,” it stays in that location until the next seek, so you can traverse along the bottom of the tree if you wish.
t = new tree(3)
t.insert(1, “data1”)
t.insert(2, “data2”)
t.seek(2)
To populate this tree, we need to add our block numbers – this tree requires integers for the keys. Since the files created above for blocks are named “aaa”, “aab”, etc, this is base-26, so we can convert it into numbers:
function location_int(file) {
block = file.substring(1);
power = 1;
result = 0;
for (i in block) {
result += power * (block[block.length - i - 1].charCodeAt(0) -
'a'.charCodeAt(0));
power *= 26;
}
return result;
}
undefined
location_int('aaab')
1
location_int('aaac')
2
location_int('aaad')
3
location_int('aabe')
30
location_int('aaba')
26
location_int('aqym')
11452
It’s pretty easy to go the other way too:
function getFile(file_num) {
res = (file_num).toString(26);
filename = '';
for (i in res) {
code = res[i].charCodeAt(0)
if (code >= 97 && code <= 122)
filename += String.fromCharCode(code + 10)
else
filename += String.fromCharCode(code - 48 + 97)
}
while(filename.length < 4) {
filename = 'a' + filename;
}
return 'x' + filename;
}
getFile(0)
"xaaaa"
getFile(14522)
"xavmo"
getFile(11452)
"xaqym"
Now that we’ve defined these relationships, we can just insert all the numbers in the tree, like so:
for (i = 1; i < 14521; i++) {
var msg = “gary” + i;
t.insert(i, {getData:function() { return msg; } });
}
t.seek(1000)
t.recnum.f()
On first glance this seems a bit stupid - we've invented the world’s most complicated array. We've stumbled across a neat database feature by accident: index ordered tables. This table happens to be pre-sorted by word and then by year, which gives performance gains to certain queries - for instance a “count(distinct word)” could be done without maintaining a large “set” of words in the DB for some memory savings.
In a relational database, storage doesn’t necessarily keep rows in any particular order. Each insert may fill a slot in an existing page, so several sequential inserts may put data in random locations. Interestingly, it is possible to influence paging in Oracle during bulk operations: “TRUNCATE table” tells it to immediately delete all pages, skipping rollback, without inspecting their contents, and “INSERT /*+ append */” tells it to create a new page- skip the step of looking for empty slots in existing pages (good for a single massive insert, and very slow for many small inserts).
Note that I've left the above objects at each node empty - when a node is accessed, it will be loaded into memory. Since I'm doing this in a browser, I'm going to load a block using Ajax. Once the block is loaded once, the getData method will replace itself with a memoized copy of the data, which helps subsequent queries (this is why a query will run faster the second time you run it, or how one query can effect the performance of subsequent queries):
function ajaxData() {
var data = null;
$.ajax({
url: "blocks/" + parent.block,
async: false
}).done(function(blockData) {
data = blockData.split("\n")
for (var i = 0; i < data.length; i++) {
data[i] = data[i].split("\t")
}
});
page.parent.getData = function() {
return data; // memoized!
}
return data;
}
This solves the problems of loading and caching data blocks, but we're still going to run out of RAM if we run a long query - we need a way to kick blocks out of the system, and revert to the 'unloaded' state.
var pages = new Array(10)
var pageIdx = 0
function mallocData() {
var newPage = malloc(this)
return newPage.parent.getData(); // will trigger an ajax call
}
function malloc(parent) {
pageIdx = (pageIdx + 1) % pages.length;
var page = pages[pageIdx];
if (page.parent) {
page.parent.getData = mallocData
}
page.parent = parent;
parent.getData = ajaxData
return page;
}
Now, we can combine this with the tree.
function buildTree() {
var t = new tree(3);
for (var i = 1; i < 14521; i++) {
var block = getFile(i)
t.insert(i, {block: block, getData: mallocData} )
}
return t;
}
t.seek(1000)
t.recnum
Object {block: "xabmm", getData: function}
block: "xabmm"
t.getData()
Array[2501]
[0 … 99]
0: Array[4]
0: "thoroughly disrupts"
1: "2008"
2: "10"
3: "9"
This shows the first lines of the 1000th file. If you then load 11 blocks, the first block is kicked out, and eventually freed by the garbage collector (it's set low for testing purposes - obviously you want much more RAM).
This, I suggest, is a good implementation of a DataFrame for large tables - it supports "show me the first 100 rows" queries common in investigation, and scales well to paging through troublesomely large data sets.
One could loop over this table to generate an index - a mapping of entries {"the dog": "xamym"}, which tells you where to start looking for specific record - once this index was built the lookup would be quite fast. Properly implemented, these indexes could be used by the R or Pandas DataFrame APIs (or a system like LINQ) to support dramatic performance improvements, without re-architecting a system that crumbles under load.