Recursion in the wild: tweet filter recommendations

A recursive function is a function defined in terms of itself. How can a function call itself and not lead to an infinite loop? As long as the recursive function eventually reaches a point where it returns a value that is not an invocation of itself, all prior invocations of the function can resolve to some value based on that value, and all is well.

For example, we might have a recursive function that returns the nth number in the Fibonacci sequence.

def fib(n):
    if n < 0:
        raise ValueError('n must be greater than or equal to 0')
    elif n == 0 or n == 1:
        return n
    else:
        return fib(n - 1) + fib(n - 2)

By definition, the 0th Fibonacci number is 0, and the first Fibonacci number is 1. The subsequent Fibonacci numbers equal the sum of the previous two Fibonacci numbers. So in the case where n = 2, our recursively defined function returns fib(1) + fib(0), which is 1 + 0 = 1. All greater Fibonacci numbers can similarly reference the smaller Fibonacci numbers, which derive from the base cases of 1 and 0. Because our function only calls itself with an integer decremented by 1 or 2, we are guaranteed to reach these base cases, and so our function will always resolve.

Recursion is often useful for searching a tree or any other data structure with arbitrary depth. Along these lines, here’s a real-world use of recursion: wynno uses recursion in making tweet filter recommendations based on a user’s votes about the tweets they like and dislike.

To recommend filters to a user, wynno considers all the tweets a user has voted on and looks for combinations of features which consistently elicit like or dislike votes. It’s this combination of features that introduces an opportunity for recursion.

Let’s define a function find_winning_combos that returns all combinations of tweet features that pass a test: the user must have voted on at least min_tweets number of tweets possessing the feature combination, and the user always liked or always disliked tweets containing that combo.

import numpy as np
from sklearn.feature_extraction import DictVectorizer

def find_winning_combos(tweets, votes, min_tweets):
    """ tweets is a list of dictionaries whose keys are binary indicators of the
    features each tweet contains, e.g. contains_hashtag, contains_picture, is_retweet,
    etc.  votes is a list containing 1's and 0's, corresponding to whether the
    user likes (1) or dislikes (0) each tweet in tweets. """

    # vectorize tweets
    vec = DictVectorizer()
    tweets_vector = vec.fit_transform(tweets).toarray()

    # create list of feature names
    feature_names = vec.get_feature_names()

    # append vote on to each vectorized tweet
    tweets_vector_with_votes = np.hstack((tweets_vector, np.atleast_2d(np.array(votes)).T))
        # tweets_vector is a numpy array, so we need to use numpy methods
        # Cf. http://stackoverflow.com/questions/5064822/how-to-add-items-into-a-numpy-array

    results = []

    def recursive_search(vector, combo_so_far):
        """ For each feature contained by at least min_tweets tweets,
            check if that feature corresponds to a 100% or 0% voting percentage.
            If it does, add that feature to results (a closure var).
            Also, search all tweets with that current feature likewise,
            i.e. for other features contained by at least min_tweets tweets and
            which have a 100% or 0% voting percentage.

            Note each row/tweet in vector must have the vote on the tweet as its last element. """

        # in each column except the vote column (i.e. for each feature),
        # count number of rows with non-zero elements
        column_tallies = (vector[:, 0:-1] != 0).sum(0) # Cf. http://stackoverflow.com/a/3797190
        for index, tally in enumerate(column_tallies):
            if tally >= min_tweets:
                sum_votes = np.sum(vector[vector[:, index] != 0][:, -1], axis=0)
                like_pct = sum_votes * 1.0 / tally
                if like_pct == 0 or like_pct == 1:
                    winning_combo = combo_so_far[:]
                    winning_combo.append(feature_names[index])
                    results.append({
                        'features': winning_combo,
                        'like_pct': like_pct,
                        'num_votes': tally,
                    })

                    # create a new subarray to be used in the recursive search

                    # it contains only rows (i.e. tweets) containing the current feature
                    sub = vector[vector[:, index] != 0]

                    # replace the current feature/column with 0's
                    # so that we don't double-count it / get stuck in an infinite loop
                    sub[:, index] = 0
                        # note that we don't want to actually remove this column, because
                        # then index wouldn't point us correctly within feature_names

                        # note also that we do want to consider columns < index because
                        # it's possible that when we consider only the tweets
                        # possessing the current feature, previous features
                        # which didn't pass the test will pass
                            # this will result in duplicate results with merely
                            # different ordering of features--in production env,
                            # we could prune these in various ways

                    recursive_search(sub, winning_combo)

    recursive_search(tweets_vector_with_votes, [])
    return results

Let’s walk through how this works. First we take a list of tweet dictionaries (tweets) and a list of votes on those tweets (votes), and we use scikit-learn to create a numpy array where the rows are tweets, and the columns are binary indicators (1 or 0) of the presence of a feature, and the last column contains the user’s vote on the tweet.

Then we define a recursive_search function which is going to actually find the combinations of features that pass our test. Inside this function, we first tally up for each feature how many tweets possess it. If a tally is greater than or equal to min_tweets, we calculate like_pct, the percentage of votes on the tweets which were likes. If the user always liked the tweets or always disliked them (that is, if like_pct is 1 or 0), the feature is a winner; it’s obviously informative about the user’s taste in tweets. So we add it to our results closure variable. But in the interest of making the most specific filter recommendations possible, we want to find whether this winning feature co-occurs with any others. So we apply the same search process to a sort of sub-array of the current one, containing only the tweets possessing the current feature. Recursion!

The results of find_winning_combos can then easily be parsed into recommendations for tweet filters. And if we want to prioritize our recommendations by the amount of information supporting them (i.e. the number of votes), and prioritize more specific filters over cruder filters, we could sort these results like so:

sorted_results = sorted(results, key=lambda k: (-1 * k['num_votes'], -1 * len(k['features'])))