In building a website scraper with Chrome and Node.js, I made mistakes that led to connection leaks. In this application, the scraper runs in a browser and connects to a node.js server, which saves data off to a database.
Once you know what the issues look like, they are easy to see, but otherwise often difficult to diagnose without code inspection. The database driver, nodepg, provides two modes for connection management. You either use built-in connection pooling, or manage connections on your own. As with connection management in any language, the distinction can be subtle, even when the result is drastically different. If you use the nodepg callback API, the connection is naturally closed on its own:
var pg = require('pg'); var conString = "tcp://postgres:1234@localhost/postgres"; pg.connect(conString, function(err, client) { client.query("SELECT NOW() as when", function(err, result) { }); });
On the other hand, if you construct a connection object on your own, you must manage closing it:
var client = new pg.Client(conString); client.connect(); var query = client.query("..."); // must keep track of client and close it... query.on('end', function() { client.end(); });
If you refactor code between the methods, you must take care to manage the process carefully. The key to finding the issue is to eliminate servers from the mix. There are numerous limited resources that can run out in this environment – e.g. database connections, net TCP connections, TCP connections per process, memory, or per-process open files. Some require OS configuration, while others are triggered by application defects.
To narrow the issue, I used Apache Bench (ab), installed through XAMPP, to eliminate the front-end (browser/parser). It’s a little tricky to send post requests correctly with ab, so I proxied test requests through Fiddler, which is a little tricky it itself. The biggest win was installing pgbouncer on the NodeJS server. On it’s own, this eliminates most of the memory issues (even with connection leaks!), in spite nodepg having it’s own connection pool. The real value though, is the log message that occurs when instantiating a connection fails – you can clearly see whether the failure was connection to pgbouncer, or connection out from pgbouncer to the database.
I wish the API tracked where connections were created, then logged unreleased sessions when the request ended. This is fairly easy to set up in Java/Tomcat in a finalizer. In theory, the web server should be able to clear memory easily after a request finishes. This is likely a significant architectural downside to a single threaded web server like NodeJS. On the other hand, it forces you to fix all the issues, which may be better long term.
To detect this type of issue, my script needs to run for a while. My application typically died after parsing about 5,000 pages. After fixing the issue, the application makes it to at least 15,000 pages, and uses about 80% less memory – it switches from being memory to CPU bound.