Recursion in the wild: tweet filter recommendations

A recursive function is a function defined in terms of itself. How can a function call itself and not lead to an infinite loop? As long as the recursive function eventually reaches a point where it returns a value that is not an invocation of itself, all prior invocations of the function can resolve to some value based on that value, and all is well.

For example, we might have a recursive function that returns the nth number in the Fibonacci sequence.

def fib(n):
    if n < 0:
        raise ValueError('n must be greater than or equal to 0')
    elif n == 0 or n == 1:
        return n
        return fib(n - 1) + fib(n - 2)

By definition, the 0th Fibonacci number is 0, and the first Fibonacci number is 1. The subsequent Fibonacci numbers equal the sum of the previous two Fibonacci numbers. So in the case where n = 2, our recursively defined function returns fib(1) + fib(0), which is 1 + 0 = 1. All greater Fibonacci numbers can similarly reference the smaller Fibonacci numbers, which derive from the base cases of 1 and 0. Because our function only calls itself with an integer decremented by 1 or 2, we are guaranteed to reach these base cases, and so our function will always resolve.

Recursion is often useful for searching a tree or any other data structure with arbitrary depth. Along these lines, here’s a real-world use of recursion: wynno uses recursion in making tweet filter recommendations based on a user’s votes about the tweets they like and dislike.

To recommend filters to a user, wynno considers all the tweets a user has voted on and looks for combinations of features which consistently elicit like or dislike votes. It’s this combination of features that introduces an opportunity for recursion.

Let’s define a function find_winning_combos that returns all combinations of tweet features that pass a test: the user must have voted on at least min_tweets number of tweets possessing the feature combination, and the user always liked or always disliked tweets containing that combo.

import numpy as np
from sklearn.feature_extraction import DictVectorizer

def find_winning_combos(tweets, votes, min_tweets):
    """ tweets is a list of dictionaries whose keys are binary indicators of the
    features each tweet contains, e.g. contains_hashtag, contains_picture, is_retweet,
    etc.  votes is a list containing 1's and 0's, corresponding to whether the
    user likes (1) or dislikes (0) each tweet in tweets. """

    # vectorize tweets
    vec = DictVectorizer()
    tweets_vector = vec.fit_transform(tweets).toarray()

    # create list of feature names
    feature_names = vec.get_feature_names()

    # append vote on to each vectorized tweet
    tweets_vector_with_votes = np.hstack((tweets_vector, np.atleast_2d(np.array(votes)).T))
        # tweets_vector is a numpy array, so we need to use numpy methods
        # Cf.

    results = []

    def recursive_search(vector, combo_so_far):
        """ For each feature contained by at least min_tweets tweets,
            check if that feature corresponds to a 100% or 0% voting percentage.
            If it does, add that feature to results (a closure var).
            Also, search all tweets with that current feature likewise,
            i.e. for other features contained by at least min_tweets tweets and
            which have a 100% or 0% voting percentage.

            Note each row/tweet in vector must have the vote on the tweet as its last element. """

        # in each column except the vote column (i.e. for each feature),
        # count number of rows with non-zero elements
        column_tallies = (vector[:, 0:-1] != 0).sum(0) # Cf.
        for index, tally in enumerate(column_tallies):
            if tally >= min_tweets:
                sum_votes = np.sum(vector[vector[:, index] != 0][:, -1], axis=0)
                like_pct = sum_votes * 1.0 / tally
                if like_pct == 0 or like_pct == 1:
                    winning_combo = combo_so_far[:]
                        'features': winning_combo,
                        'like_pct': like_pct,
                        'num_votes': tally,

                    # create a new subarray to be used in the recursive search

                    # it contains only rows (i.e. tweets) containing the current feature
                    sub = vector[vector[:, index] != 0]

                    # replace the current feature/column with 0's
                    # so that we don't double-count it / get stuck in an infinite loop
                    sub[:, index] = 0
                        # note that we don't want to actually remove this column, because
                        # then index wouldn't point us correctly within feature_names

                        # note also that we do want to consider columns < index because
                        # it's possible that when we consider only the tweets
                        # possessing the current feature, previous features
                        # which didn't pass the test will pass
                            # this will result in duplicate results with merely
                            # different ordering of features--in production env,
                            # we could prune these in various ways

                    recursive_search(sub, winning_combo)

    recursive_search(tweets_vector_with_votes, [])
    return results

Let’s walk through how this works. First we take a list of tweet dictionaries (tweets) and a list of votes on those tweets (votes), and we use scikit-learn to create a numpy array where the rows are tweets, and the columns are binary indicators (1 or 0) of the presence of a feature, and the last column contains the user’s vote on the tweet.

Then we define a recursive_search function which is going to actually find the combinations of features that pass our test. Inside this function, we first tally up for each feature how many tweets possess it. If a tally is greater than or equal to min_tweets, we calculate like_pct, the percentage of votes on the tweets which were likes. If the user always liked the tweets or always disliked them (that is, if like_pct is 1 or 0), the feature is a winner; it’s obviously informative about the user’s taste in tweets. So we add it to our results closure variable. But in the interest of making the most specific filter recommendations possible, we want to find whether this winning feature co-occurs with any others. So we apply the same search process to a sort of sub-array of the current one, containing only the tweets possessing the current feature. Recursion!

The results of find_winning_combos can then easily be parsed into recommendations for tweet filters. And if we want to prioritize our recommendations by the amount of information supporting them (i.e. the number of votes), and prioritize more specific filters over cruder filters, we could sort these results like so:

sorted_results = sorted(results, key=lambda k: (-1 * k['num_votes'], -1 * len(k['features'])))

Using Angular and Redis to rate-limit requests to the Twitter API

Twitter’s API limits an application to 15 requests on behalf of the same Twitter user in any 15-minute period. If your application allows an end-user to trigger calls to the Twitter API—say, for instance, your app is a Twitter client like wynno that allows users to scroll their timeline indefinitely into the past—you need to ensure your users can’t cause your application to exceed this rate limit.


You can and should implement some logic on the client side of your application that prevents regular users from making too many requests. In the case of an Angular app like wynno, we can keep track of our requests to Twitter by creating a service:

  .factory('TwitterService', ['$q', '$http', function($q, $http) {
    var service = {
      lastGet: null,
      getTweets: function() {
        var d = $q.defer();
        var timeSinceLastGet = service.lastGet ?
            new Date().getTime() - service.lastGet :
        if (timeSinceLastGet && timeSinceLastGet < 60000) {
          d.reject('Please try again in ' +
            Math.ceil((60000 - timeSinceLastGet) / 1000) +
            ' seconds.');
        } else {
          $http({ method: 'POST', url: '/my_server_endpoint', data: {} })
            .success(function(tweetData, status) {
              service.lastGet = new Date().getTime();
            .error(function(reason, status) {
              // Conceivably, if reason informed us that error occurred after server
              // successfully called Twitter API, we would want to update
              // service.lastGet here too.
        return d.promise;

    return service;

The code here implements a simple guard against making more than one request per minute: Whenever getTweets is called, we check to see if lastGet is from less than 60 seconds ago. If it is, we’re finished, and resolve the promise with a rejection. If it isn’t, we fire off a request to our server, which in turn will make a request to the Twitter API and send us back the tweet data it gets. When we receive this tweet data, we resolve the promise with it and update lastGet.

An alternative approach, which would accommodate up to 15 requests at any interval in a 15-minute period, would be to keep track of the timestamp of the 15th-to-last request, and check each time getTweets is called whether that timestamp is from less than 15 minutes ago.


We also need to implement some kind of rate-limiting logic on the server side. Just as in other aspects of a web application, like user authentication and input validation, client-side enforcement is not sufficient. You can’t rely on requests to your server always following the logic of your client-side application. A malicious user could easily curl 16 requests in rapid succession to the server endpoint used by getTweets. Or in the case of wynno, which fetches new tweets from Twitter when the app first loads up, simply refreshing the browser more than 15 times in 15 minutes would do the trick.

So our server needs to keep track of the requests it makes to Twitter on behalf of each user. Note that that is on behalf of each Twitter user, not simply each user session of our app. If the same person used our app on a desktop and a phone, that would be two different sessions, but Twitter’s API would issue the same user token for each. That token is what we provide in our calls to the API to identify the user whose tweet data we want. So our rate-limiting solution must work independently of user sessions. We can’t, in other words, simply refer to a lastGet timestamp stored in a session.

This is a perfect use case for a key-value store like Redis. To implement a simple one-request-per-minute guard, we can use an id or username for each Twitter user as the key, and store a timestamp of the last call to the Twitter API on behalf of that user as the value. Whenever we’re about to make a call to the Twitter API, we’ll lookup the timestamp for the user and check if it’s more than 60 seconds old.

Actually, we don’t even need to store the timestamp, because Redis allows us to set an expiration time for a key. If we set our key to expire after 60 seconds, the existence of the key itself will tell us whether or not to allow the call to Twitter. If a given key exists in the Redis database, there must have been a call made to Twitter for that user in the last 60 seconds, so we don’t allow another. (Of course, if we want to be able to tell the user how much longer they have to wait, then we would want to store the timestamp.) The advantage of using expiring keys here is that it keeps our Redis database small. We’ll only have as many keys as we have users who’ve called the Twitter API in the last minute. This should make our queries faster than if we kept a key for every user who’d ever called Twitter.

Here’s what the code would look like on an Express server:

var redis = require('redis');
var config = {
  port: 6379, // Redis' default
  host: "",
  dbNum: 1

var client = redis.createClient(config.port,;, function() { /* ... */ }); // If we're using Redis
// to store sessions in database 0, then it makes sense to keep track of Twitter
// calls separately in database 1.

var setRateLimiter;
exports.setRateLimiter = setRateLimiter = function(userId, callback) {
  // Creates a key which expires in 60 seconds containing the current time.

  client.setex(userId, 60, new Date().getTime().toString(), function(err, reply) {
    if (err) {
    } else {

exports.checkRateLimiting = function(userId, callback) {
  client.get(userId, function(err, reply) {
    var lastGet = parseInt(reply, 10); // reply is a string or null
    if (err) {
    } else if (lastGet) {
      var timeSinceLastGet = new Date().getTime() - lastGet;
      callback('Please try again in ' +
        Math.ceil((60000 - timeSinceLastGet) / 1000) +
        ' seconds.');
    } else {
      // Proceed with call to Twitter after setting a new rate limiter.
      // Note that we want to set this rate limiter BEFORE we make the call to Twitter
      // as well as AFTER, so that user cannot trigger >15 calls to Twitter API
      // in period before response to first call is received.

      setRateLimiter(userId, callback);

If we wanted to allow up to 15 requests at any interval within a 15-minute period, rather than just one per minute, we would store an array of the timestamps of (up to) the last 15 requests, setting the key expiration to 15 minutes after each update of the array.

Introducing wynno, continued

(UPDATE: wynno has been discontinued and is no longer active at This post remains as a reference.)

Filters are simple, effective, and the essence of wynno. But there is more complexity to wynno that you can take advantage of if you want to:

  1. wynno can suggest filters for you to adopt.
  2. wynno can even create a customized filter just for you that will predict which tweets you won’t like and mute them. We call this automatic wynnoing.

wynno can do these things once you give it a bit of training, to recognize the tweets you like and dislike. wynno places vote buttons next to tweets that aren’t handled by any of your filters.

Yea Nay

When you vote on tweets, you’re telling wynno, “I like tweets like this and want to see this kind of tweet in the future,” or the opposite. It’s wynno’s job to figure out what you mean exactly by ‘like this’ and ‘this kind of tweet’. Because wynno needs a fair amount of data to figure that out, filter suggestions are currently offered after every 100 votes.

Automatic wynnoing requires yet more data to work well; 200 votes are necessary before it can be activated. We’ll discuss automatic wynnoing in more depth in another post. For now, suffice it to say that the algorithm behind automatic wynnoing is conservative: we recognize that for most users, it’s more important not to mute any tweets that you do like than to find and mute all the tweets that you don’t like.

One last thing, about privacy: your activity on wynno is between you and wynno. The filters you create, your votes on tweets—none of these is visible to other users. wynno’s goal is to improve the experience of consuming your Twitter timeline, by reducing the amount of time spent reading tweets that aren’t valuable to you. It helps you turn the cacophony of chirping birds into something more like a symphony. It is not a social network and does not create another public, online presence for you to manage. If you have thoughts on that, or any other aspect of wynno, we’d love to hear from you:


Introducing wynno

(UPDATE: wynno has been discontinued and is no longer active at This post remains as a reference.)

wynno is a tool for seeing the tweets in your Twitter timeline that interest you, and skipping the rest.

The more people you follow on Twitter, the more you probably find yourself sifting through tweets that don’t interest you to find the ones that do. That’s the problem that wynno wants to solve.

It’s not a problem that can be solved by unfollowing people. Unfollowing someone is too crude a solution if you like some of a person’s tweets, and don’t want to miss out on those.

wynno enables a finer approach: you can apply specific filters to your Twitter timeline. Filters are simply rules you make about the kinds of tweets you want to hear or mute (a tweet is, after all, a bird noise, so we thought the language of hearing made sense).

Jean-Fran├žois Millet - Le vanneur

Maybe you always want to hear tweets that contain a particular hashtag. Maybe for your work it’s important to keep up with breaking news, and so you always want to hear tweets that contain a keyword like ‘BREAKING’. Or maybe you couldn’t care less about breaking news and want to mute those tweets. Maybe you never want to see Foursquare updates or links to Facebook from that former coworker of yours, whom you don’t really know anyway but nevertheless don’t want to unfollow because that would sow seeds of awkwardness. Maybe you always want to hear tweets by that erudite friend of yours whenever she tweets quotations. But maybe you’ve seen enough baby pictures from her at this point.

You can make filters for all of these. All tweets caught by a ‘hear’ filter or which aren’t caught by any filter are put in a timeline called Good Stuff. This is the timeline you see first when you load wynno. All tweets caught by a ‘mute’ filter get moved to another timeline called The Rest. You can look at The Rest as much as you want to, or not at all.

That’s the simple innovation of wynno: splitting your Twitter timeline in two, according to the rules you make.

We’d love for you to send feedback to Thanks for wynnoing!

Ian Hinsdale

Communicating between node.js and Python

Can a node.js server call Python functions? Yes, yes it can.

In a recent project, I needed to do some natural language processing. Since I was running a node server, one solution would have been to use the natural node module and do the computation entirely in Javascript. Natural looks promising, but I preferred to use the more established Natural Language Toolkit (nltk) Python library, which has some excellent, free documentation in the form of Natural Language Processing with Python.

How can node and Python talk to each other? The answer depends on whether the node and Python processes are being run on the same machine or separate machines.

For node and Python on the same computer, there are two solutions: (1) node could talk to Python via Unix domain sockets, or (2) node could call the Python functions as child processes.

But running node and Python on the same machine isn’t a scalable solution: as the natural language processing in Python becomes more and more processor- and memory-intensive, those resources would be diverted away from serving web requests. The node server’s response time would suffer and users would go elsewhere.

The scalable solution is to have a separate machine dedicated to the Python work. The node server can talk to this machine via TCP sockets. Rather than work with sockets directly and deal with problems like what happens when a packet gets dropped, I chose to use a messaging framework which takes care of that messy work. ZeroMQ is a messaging framework which has been described as “sockets on steroids”. There is some debate over just how much of an upgrade ZeroMQ is over sockets. I have nothing to add to that debate, but it seems generally agreed that there is an advantage in the robustness of its queuing for a variety of messaging patterns.

So ZeroMQ is a framework for exchanging messages, with implementations for 40+ languages including Javascript and Python. But my goal was not just to exchange messages between my node server and a Python listener, but for the node server to actually invoke Python code and get the result back. Fortunately there is ZeroRPC: a library for node.js and Python built on top of ZeroMQ that enables remote procedure calls (RPC).

Using ZeroRPC is a breeze. Following the steps in their Hello, World example, we can see that on the Python side of things, all we have to do is run a Python script which creates a ZeroRPC server object out of a Python class (in this case the HelloRPC class), and set that server to listen on port 4242 for incoming connections from any IP.

import zerorpc

class HelloRPC(object):
    def hello(self, name):
        return "Hello, %s" % name

s = zerorpc.Server(HelloRPC())

On our node server, we create a ZeroRPC client object which connects to the listening Python server and can invoke any of the functions in the class with which the server was created. In this case, we invoke the hello function and pass along the "World!" string as its argument.

var zerorpc = require("zerorpc");

var client = new zerorpc.Client();

client.invoke("hello", "World!", function(error, res, more) {

Hello, World! gets returned to our node server.

In this example, the Python server was created on the same machine as the node server, so we connected to it at localhost ( If we were running Python on a different machine, as I was doing in my caracol project, communicating with that machine is as simple as replacing that address.

And there you have it: node.js and Python communicating in a service-oriented architecture. Brilliant!

A final thing to note: ZeroRPC depends on libevent and ZeroMQ, so you’ll need to install those before installing ZeroRPC. Here is a shell script for Ubuntu 12.04 LTS which installs everything you need. For Mac OS X, you can install the system dependencies using Homebrew.


# Simple script for installing ZeroRPC on Ubuntu 12.04 LTS

# System dependencies

# First install ZeroMQ
sudo apt-get install libzmq-dev

# Next install libevent, an event notification library required by zerorpc
sudo apt-get install libevent

# Python dependencies

# Now install pyzmq: Python bindings for ZeroMQ
# If you don't already have pip installed:
sudo apt-get install python-setuptools
sudo apt-get install python-pip
sudo pip install pyzmq

# Now we can install ZeroRPC
sudo pip install zerorpc

# Node.js dependencies

# Just install the ZeroRPC node module
sudo npm install -g zerorpc