Hiding the details and building a better API
One of the things I really like is code evolution, see how code evolves over time and how nice solutions can be built.
In Plurk we have some scripts that aggregate data and do some analysis. These kind of scripts must not put too much overhead on the database server. This means that selects must be inexpensive.
Doing a following query:
SELECT * FROM users;
will be expensive, at least in MySQL and using MySQLdb. To speed it up, one needs to use limits and offsets:
SELECT * FROM users LIMIT 0, 10;
But once you start using limits and offsets the code gets ugly and a typical solution is following:
def _selectUsers(offset=0, limit=350): if offset: return db().select('users', limit=limit, where=['id > %s', offset]) return db().select('users', limit=limit) for user in selectUsers(): offset = None while True: users = _selectUsers(offset) if len(users) == 0: break for user in users: offset = user.id print user
And if you do this often, then this boilerplate code makes things really ugly. A first improvement can be made by extracting things in functions:
def _selectUsers(offset=0, limit=350): if offset: return db().select('users', limit=limit, where=['id > %s', offset]) return db().select('users', limit=limit) def selectUsers(): offset = None while True: users = _selectUsers(offset) if len(users) == 0: break for user in users: offset = user.id yield user return for user in selectUsers(): print user
With this approach you can then build different functions that aggregate different tables / databases, but still, it's tiresome to write new aggregators.
So the first evolution is going from copy-paste to aggergators that can be reused. The next step is solving the problem generally so one can with minimal code create new aggergators.
The generic and final approach looks like this:
def aggregate(table, aggregate_col='id', limit=350, **kw): offset = None kw['limit'] = limit while True: if offset: kw['where'] = ['%s > %s' % (aggregate_col, '%s'), offset] rows = db().select(table, **kw) if len(rows) == 0: break for row in rows: offset = row[aggregate_col] yield row return
Not only do we save time to create new aggregators, this generic approach also makes it trivial to aggregate over different columns and with more selects, for example:
#Aggregate over users table for user in aggregate('users'): print user #Aggregate over visits table for user in aggregate('visits'): print user #Aggregate over all females for user in aggregate('users', gender=0): print user
So the general lesson here is that one should aim to build generic code, but probably as it's needed.