Hacking Customer Service with Python

Customer support questions come in many forms, some of which are difficult to answer:

  • Why does this screen show different results than it did a week ago?
  • Why does screen A show ten records when screen B shows five?
  • And, a favorite, what don’t you know about this problem?

Answering these questions in a satisfactory way is challenging, and a few large tech companies have avoided this challenge by not providing customer service. SQL was built to map human questions to a data model, and while it doesn’t make the above questions easy, it’s embedded in many existing systems, and it makes sense to work with what we have.

These questions tend to arise when an application applies many rules to achieve it’s results – for instance, Google claims to apply over two hundred factors in it’s ranking algorithms.

If your product uses a relational database, you can get a lot of mileage from extracting SQL queries from a log, finding what where clauses are included, then documenting the result – a process that might be executed by a helpdesk with heavy technical skills.

Consider, however, that the technical community has made great advances in our ability to deal with data, so we ought to be able to automate some of this away.

Consider the following hypothetical query, which searches for files visible to a particular user, with names matching a typed search. While there are better ways to build a system that does this, many do work this way:

SELECT *
FROM files
WHERE
  file_name like '%test%' AND
  parent_folder = 'test_folder' AND 
  ( owner_user = 'gsieling' OR
    file_groups IN 
      ( SELECT name  
        FROM groups 
        WHERE member_user = 'gsieling' ) )

In this example, files may be hidden because of any combination of three filters. These are straightforward, and should be pretty simple to explain to a customer (note that we’re not explaining why). Rather than filtering rows out, it is helpful to transform this query into an equivalent form, which returns each row and whether it should have been included:

SELECT 
   (case when file_name like '%test%' then 1 else 0 end) as name_condition,
   (case when parent_folder = 'test_folder' then 1 else 0 end) as folder_condition,
   (case when owner_user = 'gsieling' then 1 else 0 end) as owner_condition,
   (case when file_groups IN 
      ( SELECT name  
        FROM groups 
        WHERE member_user = 'gsieling' ) then 1 else 0 end) as group_condition
FROM files

Often the first form of these queries are generated by an application layer, but the second form is something I’ve not seen used.

SQL is a complex language, but fortunately there are libraries that handle parsing, which fits this problem well, especially if you’re using an ORM, since the queries will follow a pattern.

A Python library called sql-parse can tokenize and tag queries without validating them, which is a perfect fit.

You can parse a query simply:

stmt = sqlparse.parse(sql)[0]

If we could use this result to find each section of the WHERE clause, we could trivially turn these into CASE statements:

def generate_case(token, index):
  return '(case when (' + token + ') then 1 else 0 end) as a' + str(index)

Sqlparse returns parts of the query as a parse tree, with each object marked with attributes identifying the role a token plays in the query. We need to split the query into the part that is in the WHERE clause, and the part that isn’t. Having done that, we’ll reformat the conditions and place them in the SELECT (those with SQL experience will likely object: we’re not handing nested queries, common table expressions, or conditions on JOINs, but consider that this is a thought experiment)

def remove_where(stmt):
  return [x for x in stmt.flatten() \
     if not x.within(sql.Where)]

def get_where(stmt):
  results = [x for x in stmt.tokens \
     if type(x) is sql.Where]

  if len(results) > 0:
    return results[0]
  else: 
    return None

The tricky part is extracting all the conditions – to do this we can recurse through the entire WHERE clause, pulling out each combination of conditions.

def get_clauses(tokens):
  found = []
  for token in tokens:    
    if (token.is_group() and type(token) is sql.Parenthesis):
      found.append(token.to_unicode())
      found.extend(get_clauses(token.tokens))
    elif (token.is_group()):
      found.append(token.to_unicode())

  return found

To put this back together, we need to find the SELECT within the query. This demonstrates the biggest risk of this type of process: there are many ways to write the same query, so the only way this works is if queries are generated consistently by our application.

def replace_select(tokens, toSelect):
  foundSelect = False
  foundFrom = False
  addedSelect = False

  resultSql = ''
  for token in tokens:
    if token.value.lower() == 'from':
      foundFrom = True

    if not foundSelect or foundFrom:
      resultSql = resultSql + token.to_unicode()
    else:
      if not addedSelect:
        resultSql = resultSql + ' ' + toSelect + ' '
        addedSelect = True

    if token.value.lower() == 'select':
      foundSelect = True

  return resultSql

We now have all we need to re-assemble a new query – the full source and test cases are available on Github.

You might ask, what is the point of all this?

In less than eighty lines of Python, we’ve turned entries in a production log into a powerful tool for answering a client’s tickets.

Not only that, we’ve turned a tedious research task into an interesting engineering challenge, and increased the scope of what a helpdesk can achieve.