RecSys 2013

November 6th, 2013 1 comment

AcmRecsysIt has been 3 years since I attended my last RecSys and last week I went again to that conference in Hong Kong. I came back home with lots of useful insights although to get those gems I had to go through lots of boring to death presentations. Hey guys, presenting to an audience doesn’t mean cut and pasting from your latex article directly into your beamer presentation. You have to think about how to communicate to an audience. And don’t tell me it’s about algorithms and complex mathematics, so it was Harald Steck’s from Netflix and it was very pedagogical. Next year make sure you think in your audience before reporting your research.

This is my personal and biased review of the conference. It’s mainly what I found interesting and what it could be easily deployed at my current employer, Softonic.

On Saturday we had an excellent tutorial about Learning To Rank for Recommender Systems by Alex, Linas and Shi. Here I’m completely biased because I’ve worked with them and I know how good they are, but this tutorial was really interesting and full of stuff. I recommend it for a very good introduction to this complex topic.

calle-hong-kong-smallFor me the most interesting session of the conference was the workshop on Large Scale Recommenders. There was an interesting presentation about how to use Multi-armed bandits for recommending videos in a popular Dutch multimedia site. This talk was full of practical information about the technological stack for deploying bandits at scale. Some guys from eBay where presenting how do they recommend items in a domain where users are open to sell whatever they want and describe it free form. They were creating fine grained clusters based mainly on the search queries. The idea was to create semantically consistent cluster of items. They identified different use cases pre-purchase and post-purchase and it seemed that this cluster solution was working very well. For us in the app domain, this also makes sense, when browsing you are interested in similar applications but after a download you are interested in complementary apps not apps in the same niche. Aapo Kyrölä gave two excellent talks about GraphChi and about Personalized PageRank for Big Data. Read them. Funny comparisons about how much it took for his laptop against giant Hadoop clusters. Main take aways, “Before doing anything, think twice” and “Don’t follow the hype, even GraphChi’s hype”.

Although I was in different sessions there were other presentations during the weekend that generated Twitter action, like the keynote in the News Recommender Systems workshop and the tutorial on people recommendations from Linkedin.

From the Monday sessions I liked two papers trying to introduce diversity in the recommendations. The first Trading-off Among Accuracy, Similarity, Diversity, and Long-tail: A Graph-based Recommendation Approach from Baidu where you can model your costs for multi targeted objectives and a random walk in the graph provides the recommendations. The second paper was Nonlinear Latent Factorization by Embedding Multiple User Interests from Google. Here the user is described as a set of latent vectors each one describing one of the user’s interests.

On Tuesday, the Microsoft’s paper, Xbox Movies Recommendations: Variational Bayes Matrix Factorization with Embedded Feature Selection from Noam Koenigstein was very interesting and I think we could use a similar approach for recommending software at Softonic. Here a probabilistic graphical model was proposed for an scenario of binary implicit feedback enriched with metadata in the form of tags. From this same author I discovered in the Large Scale Recommenders workshop the work he has been doing on using Matrix Factorization results in real time. This paper was the perfect companion to the poster he presented about Item-Oriented recommendations. In the afternoon there was the paper Rating Support Interfaces to Improve User Experience and Recommender Accuracy that focused on improving UI for recommenders for helping in eliciting true preferences. Very interesting.  Also of interest were Pairwise learning in recommendations of communities from Linkedin and Which App Will You Use Next that proposed a recommender system that predicts which app has higher probability of being used next based on your past behavior. It would be very interesting to pilot a widget for Moba.

The last day of the conference, had the Industry Session with lots of gems, like Huawei App Store using Deep Neural Networks for recommendations so they don’t spend time doing feature engineering. Alibaba told that Diversity and Complementary are worse for CTR but better for conversion, so watch your KPIs. Tencent showed all the metadata for users and items they use and the distributed hybrid algorithm they use to deliver recommendations. For us in the western world, used to the Facebooks and Googles, seeing the technology the Chinese companies are using/developing was a big surprise.

neon_hong_kong-smallThe best paper award was A Fast Parallel SGD for Matrix Factorization in Shared Memory Systems, hardcore algorithmic approach but far from my day to day routine. I’ve mentioned before the  paper about Rating-prediction and Ranking from Netflix that addresses the selection bias (you are more likely to rate highly/low valued items but nothing in between) as a related sub-problem. By modeling the decision to rate you gain insight on all the unknown ratings.

The last session about Scalability was the most interesting with all the papers in my ToRead list. Using maximum coverage to optimize recommendation systems used well known algorithms of combinatorial optimization in order to select the set of items with maximum probability of conversion. In Efficient top-n recommendation for very large scale binary rated datasets they presented a large scale matrix factorization algorithm for implicit datasets. This guys presented the extension to the algorithm that won the Million Song contest in Kaggle so this paper must be read carefully. In Distributed Matrix Factorization with MapReduce using a series of Broadcast-Joins it was exposed a distributed Alternating Least Squares (ALS) approach to Matrix Factorization that is going to be included in Mahout. They also taught me about a toolkit for generating synthetic data, Myriad,  something I have to research more carefully.

And that was it!! I had a very good time and glad to meet lot of people I admire in person. Hopefully see you next year!!

Introduction to the Minhash algorithm

March 11th, 2013 4 comments

Say that you have a collection of N objects and you want to want to compare each object against each other. Maybe you are interested in the distance between objects in an Euclidean Space or just the number of attributes or features they share. If you need to perform each comparison, you’ll need to perform \(O(N^2)\) and that’s a problem when your N is in the order of the millions or tens of millions, likely the order of your user or item company’s database. Is there a way to perform less comparisons when we aren’t interested in the whole set of comparison but just in finding the set of closest items under a given measure? Yes, the Minhash Algorithm.

In this post we are going to be focused in the Jackard Coefficient for determining the closeness between two objects. Each of the N object will have a set of m features where usually is very large. You can think of the objects in N as the members of a social network or Amazon customers, and the set of features that describe them all the friends they have in the social network or all the books they previously purchased in the second case. Comparing two users would imply comparing their feature sets using the Jackard Coefficient formula:

$$JackardCoeff(A,B)=\frac{|A\cap B|}{|A\cup B|}$$

Understanding the Minhash Algorithm

All the MinHash functionality relies on this mathematical property

$$P(\min\pi(A)=\min\pi(B)) = \frac{|A\cap B|}{|A\cup B|}$$

Where \(\pi\) is a function that creates a permutation of its set argument. This reads as follow, Given a random function over the orderings of the elements of a universe U to which A and B belongs, the probability that the minimum element on both sets coincides is equal to the Jackard Coefficient defined as the number of elements in the intersection of the sets over the number of elements in the union.

What does this mean? Let’s break down the constituent parts. The function \(\pi\) is a function that transforms the natural ordering of a set into a new ordering. There are lots of different functions that create different orders for a given set. Let’s see this with an example.

Say that our set is the set of vowels in the alphabet \(V={a,e,i,o,u}\). A natural ordering of the elements would be \(a<e<i<o<u\). Another different ordering could be a function \(\pi_X\) that given the set of vowels V generates the following ordering \(o<e<i<u<a\). For convenience and since the ordering is a Total Order, we can map the elements of the set to the natural numbers taking the position of the element in the ordering. For the natural ordering of the vowels, \(\pi_{NAT}\), we have that

a 1
e 2
i 3
o 4
u 5

and with the other ordering \(\pi_X\) the vowels map to

a 5
e 2
i 3
o 1
u 4

Thus given an ordering, if we apply it to a subset of the universe of objects we get their positions in the new ordering. For instance:

So going back to the original formula, the probability that the minimal value of the transformation of each set, say \(A=\{e,i,o\}\) and \(B=\{a,e\}\), according to a new orderding, \(\pi_X\), is equal to the Jackard Coefficient. In this case we can compute the Jackard Coefficient exactly and it is:
$$\frac{|\{e,i,o\}\cap\{a,e\}|}{|\{e,i,o\}\cup\{a,e\}|} =\frac{|\{e\}|}{|\{a,e,i,o\}|} = \frac{1}{4} $$

This number means that whenever we pick a transformation \(\pi\) from the set of all possible transformations \(\Pi\) the probability that the smaller number is the same in both sets is equal to 0.25. Think that we are talking here about a probability and that it is not a rule. If the permutation choosen at random was the previously defined function \(\pi_X\) then we would have, \(\min\pi_X(A)=\min\{2,3,1\}=1\) that is not the same value as \(\min\pi_X(B)=\min\{5,2\}=2\).

Hashing functions

Defining a random ordering for all the elements of a universe is costly. We have to maintain in memory the tables we saw for the vowels and that can not be practical when the universe of objects is large. So instead of maintaining the explicit random permutation we use the implicit permutation that hashing functions with good properties give us. The good properties we need is that the probability of collision in the universe is low because a random permutation is a 1 to 1 relationship between elements and positions and thus collisions in the hashing function would break this property.

From now on, I’m going to assume that every universe we need to work with is smaller than the range of integers and thus restrict all the derivations to integer since every set and ordering could be mapped to the natural integer ordering (\([0..2^{32}]\)).

Possible hashing functions for integers could be:

  • Use the FNV-1 hashing function of the string representation of your integer. This has the advantage that naturally deals with hashing sets of integers by means of concatenating their string  representations.
  • Use a pseudo-random number generator by picking a large prime number, say \(p=2147483647\) and a set of coefficients \(\{a_i,b_i\}\) such our hashing functions are defined as \(\pi_i = (a_i x + b_i) \%  p \)

In this post we will use the second kind of functions expresed in this Haskell code:

p :: Int

coefs :: [Int]
coefs = unfoldr (Just . randomR (0,p)) $ mkStdGen 0

hashFuncs :: [Int->Int]
hashFuncs = go coefs
    go theCoefs = let (a:b:[],rest) = splitAt 2 theCoefs
                  in (\x -> (a*x+b) `mod` p) : go rest

So, given a hashing function, which is the minimum hash over a given set? This code gives you that result

minHash :: (Int->Int) -> [Int] -> Int
minHash f = foldl' hashAndMin maxBound
    hashAndMin a b = min a (f b)


OK, now you want to compare two sets, A and B to see how similar they are. The similarity measure here will be the number of items they both share, the Jackard coefficient. The time complexity of the set intersection is \(O(n+m)\) if you first build a Hash Map and then query it with the second set for existence. If both sets are already sorted you can compare them in \(O(n)\) \(O(m+n\log n+ m\log m)\).

Can you compare them faster? Yes, but only if you want to trade accuracy. The property in the probability formula at the beginning of the post states that with probability the Jackard coefficient the min hashes of the two sets will be equal. That means that if we take r different hashing functions (permutations) and we name s the Jackard coefficient between the tow sets, the probability that all k hashes of each set are equal is \(s^r\). Since this is a collection of Bernoulli trials, the distribution of having k min hashes equal in both sets follows a Binomial distribution with density probability function equal to:

$$f(k;r,s)= {r \choose k}s^k(1-s)^{r-k}$$

It is known that a random variable distributed according to the binomial distribution, \(X \sim B(r,s)\) has \(E[X]=rs\), so a good estimator of \(s\) would be to divide the number of matches by the number of hashing functions. This would give you the estimated value of the Jackard coefficient. But before comparing two fingerprints we have to compute them:

shingle :: Int -> [Int] -> [Int]
shingle c nums = [minHash f nums | f <- take c hashFuncs]

The shingle function gives us a set of c fingerprints of our original big set of numbers.

So let’s see it in action. Imagine that we want to compute the Jackard coefficient between the bag of letters in the phrase Humpty Dumpty sat on a wall and Humpty Dumpty had a great fall. First let’s compute the Jackard coefficient according to the definition

import Data.List
import System.Random
import Data.Char
import Data.List

phrase1="Humpty Dumpty sat on a wall"
phrase2="Humpty Dumpty had a great fall"
mkSet = nub . filter isAlpha . map toLower
-- Set 1 is the collection of letters in the first phrase
-- "humptydsaonw"
set1 = map ord $ mkSet phrase1 -- "humptydsaonwl"
-- Set 2 is the collection of letters in the second phrase
-- "humptydagrefl"
set2 = map ord $ mkSet phrase2 -- "humptydagrefl"
jackard = a/b
    a = fromIntegral $  length (set1 `intersect` set2) ::Double
    b = fromIntegral $  length (set1 `union` set2) ::Double
p :: Int

coefs :: [Int]
coefs = unfoldr (Just . randomR (0,p)) $ mkStdGen 0

hashFuncs :: [Int->Int]
hashFuncs = go coefs
    go theCoefs = let (a:b:[],rest) = splitAt 2 theCoefs
                  in (\x -> (a*x+b) `mod` p) : go rest

minHash :: (Int->Int) -> [Int] -> Int
minHash f = foldl' hashAndMin maxBound
    hashAndMin a b = min a (f b)

shingle :: Int -> [Int] -> [Int]
shingle c nums = [minHash f nums | f <- take c hashFuncs]

shingles1 = map (\x -> shingle x set1) [1..]
shingles2 = map (\x -> shingle x set2) [1..]

jackard2 = map approxJackard shingles
    shingles = zip shingles1 shingles2
    approxJackard :: ([Int],[Int]) -> Double
    approxJackard (as,bs) = let pairs = zip as bs
                                matches = filter (\(a,b) -> a == b) pairs
                                num = fromIntegral $length matches :: Double
                                den = fromIntegral $ length as :: Double
                            in num/den

If we plot jackard2 we have an approximation based on the number of shingles:
Approximating Jackard Coefficient


OK, let me summarize a little bit.

  1. You came here because you have a problem of comparing and object against N different objects and N is very large. 
  2. The comparison you want to perform is by computing intersections between the attributes of objects and compute how many are equal.
  3. You want a cheap procedure for computing 2)

So the solution is to transform each object into a set of different values, and then perform the Jackard coefficient on the reduced set of values. For instance, each object could be the set of friends in a social network, and the value c could be for instance just 10. Here we reduce the computation of the intersection from the hundreds to just 10 values. Your friends got summarized in a set of 10 values.

Can we do better? Yes. Here we reduced the number of comparison from the cardinality of the number of features of the object to the small number c. But we still have to perform N comparisons against the N individuals in the population. Wouldn’t it be nice to just pick the individuals we think are more likely to have a high Jackard Coefficient? That’s possible with the use of bands but since this post is already very long, I’ll leave that for another post.


  1. Minwise independent permutations The paper with the math that demonstrates the link between the Jackard Coefficient and the probability.
  2. Chapter 3 of the excellent book “Minning massive datasets”
Categories: Data Mining Tags: ,

Training Gradient Boosting Trees with Python

November 5th, 2012 7 comments

I’ve been doing some data mining lately and specially looking into Gradient Boosting Trees since it is claimed that this is one of the techniques with best performance out of the box. In order to have a better understanding of the technique I’ve reproduced the example of section 10.14.1 California Housing in the book The Elements of Statistical Learning. Each point of this dataset represents the house value of a property with some attributes of that house. You can get the data and the description of those attributes from here.

I know that the whole exercise here can be easily done with the R package gbm but I wanted to do the exercise using Python. Since learning several languages well enough is difficult and time consuming I would prefer to stick all my data analysis to Python instead doing it in R, even with R being superior on some cases. But having only one language for doing all your scripting, systems programming and prototyping PLUS your data analysis is a good reason for me. Your upfront effort of learning the language, setting up your tools and editors, etc. is done only once instead of twice.

Data Preparation and Model Fitting

The first thing to do is to load the data into a Pandas dataframe

import numpy as np
import pandas as pd

columnNames = ['HouseVal','MedInc','HouseAge','AveRooms',

df = pd.read_csv('cadata.txt',skiprows=27, sep='\s+',names=columnNames)

Now we have to split the datasets into training and validation. The training data will be used to generate the trees that will constitute the final averaged model.

import random

X = df[df.columns - ['HouseVal']]
Y = df['HouseVal']
rows = random.sample(df.index, int(len(df)*.80))
x_train, y_train = X.ix[rows],Y.ix[rows]
x_test,y_test  = X.drop(rows),Y.drop(rows)

We then fit a Gradient Tree Boosting model to the data using the scikit-learn package. We will use 500 trees with each tree having a depth of 6 levels. In order to get results similar to those in the book we also use the Huber loss function.

from sklearn.metrics import mean_squared_error,r2_score
from sklearn.ensemble import GradientBoostingRegressor
params = {'n_estimators': 500, 'max_depth': 6,
        'learn_rate': 0.1, 'loss': 'huber','alpha':0.95}
clf = GradientBoostingRegressor(**params).fit(x_train, y_train)

For me, the Mean Squared Error wasn’t much informative and used instead the R2 coefficient of determination. This measure is a number indicating how well a variable is able to predict the other. Values close to 0 means poor prediction and values close to 1 means perfect prediction. In the book, they claim a 0.84 against a 0.86 reported in the paper that created the dataset using a highly tuned algorithm. I’m getting a good 0.83 without much tunning of the parameters so it’s a good out of the box technique.

mse = mean_squared_error(y_test, clf.predict(x_test))
r2 = r2_score(y_test, clf.predict(x_test))

print("MSE: %.4f" % mse)
print("R2: %.4f" % r2)

Data Analysis

Let’s plot how does the training and testing error behave

import matplotlib.pyplot as plt

# compute test set deviance
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)

for i, y_pred in enumerate(clf.staged_decision_function(x_test)):
    test_score[i] = clf.loss_(y_test, y_pred)

plt.figure(figsize=(12, 6))
plt.subplot(1, 1, 1)
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-',
                label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
                label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')

As you can see in the previous graph, although the train error keeps going down as we add more trees to our model, the test error remains more or less constant and doesn’t incur in overfitting. This is mainly due to the shrinkage parameter and one of the good features of this algorithm.

When doing data mining as important as finding a good model is being able to interpret it, because based on that analysis and interpretation preemptive actions can be performed. Although base trees are easily interpretable when you are adding several of those trees interpretation is more difficult. You usually rely on some measures of the predictive power of each feature. Let’s plot feature importance in predicting the House Value.

feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(12, 6))
plt.subplot(1, 1, 1)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')

Once variable importance has been identified we could try to investigate how those variables interact between them. For instance, we can plot the dependence of the target variable with another variable has been averaged over the values of the other variables not being taken into consideration. Some variables present a clear monotonic dependence with the target value, while others seem not very related to the target variable even when they ranked high in the previous plot. This could be signaling an interaction between variables that could be further studied.

from sklearn.ensemble.partial_dependence import plot_partial_dependence

fig, axs = plot_partial_dependence(clf, x_train,

Partial Dependence
The last step performed was to explore the capabilities of the Python libraries when plotting data in a map. Here we are plotting the predicted House Value in California using Latitude and Longitude as the axis for plotting this data in the map.

from mpl_toolkits.basemap import Basemap
predDf = pd.DataFrame(x_test.copy())
predDf['y_pred'] = clf.predict(x_test)

def california_map(ax=None, lllat=31.5,urlat=42.5,
    m = Basemap(ax=ax, projection='stere',
                lon_0=(urlon + lllon) / 2,
                lat_0=(urlat + lllat) / 2,
                llcrnrlat=lllat, urcrnrlat=urlat,
                llcrnrlon=lllon, urcrnrlon=urlon,
    return m

m= california_map()
predDf = predDf.sort('y_pred') # Useful for plotting
x,y = m(predDf['Longitud'],predDf['Latitude'])
serieA = (np.array(predDf['y_pred']) - predDf['y_pred'].min())/(predDf['y_pred'].max()-predDf['y_pred'].min())
# z =
z = np.array(predDf['y_pred'])/1000
c = m.colorbar(location='right')
c.set_label("House Value (Thousands of $)")


This blog post was written using Pylit as the tool for doing Literate Programming. The code can be seen here. I think that literate programming is way superior to other software methodologies like TDD when coding algorithms for data analysis. The main problem I find right now is the lack of proper tooling for really taking advantage of literate programming, but this is a technique that I’m definitely going to research deeper.

Categories: Data Mining Tags: , , ,

Thread pool in C++

May 23rd, 2012 No comments

I needed to convert user ids spread across a lot of files into a fixed range [0..N] where N was the total number of Ids in the dataset. First I though that since those files came from a Hadoop cluster I should write a MR job to do the task. But since recoding ids needs a “central authority” giving unique ids without collision, MapReduce wasn’t an option because MR thinks about each record as independent of the rest of records, so coordinating the assignment of ids is both difficult and unnatural in MapReduce. The naïve approach is to create a counter, loop through all the ids and whenever an id is not in the dictionary, use the counter as the new translation. See the pseudocode

int counter = 0;
for id in ids:
    if id not in dict:
        dict[id] = counter
    print dict[i]

But then you are wasting precious cores that could help you. My machine has eight cores, so for a task that runs in aprox 60 minutes, so after investing time in going beyond the naïve approach, I’m able to lower it to 10 minutes. That means that I can run tests 6 times faster, it will pay off.

Lookup table

So the first thing to do is to create a thread safe Hash Map. Most of the time the access will be for reading a value and less frequently for updating the data structure (in my problem I perform 250 reads for each write) so this scenario is ideal for a Readers-writer lock. I use the Boost Thread library with its boost::shared_mutex for getting the multiple access functionality. The class is something like this

using namespace boost;
class LookupTable {
        typedef std::unordered_map<int,unsigned int> HashMap;
        HashMap dict;
        unsigned int counter;
        shared_mutex m_mutex; 
        unsigned int translate(int number){
            boost::shared_lock<boost::shared_mutex> lck(m_mutex);
            unsigned int result;
            HashMap::iterator elem = dict.find(key);
            if( elem == dict.end()){
               boost::upgrade_lock<boost::shared_mutex> uLck(m_mutex);
               boost::upgrade_to_unique_lock<boost::shared_mutex> uuLck(uLck);
               dict[key] = counter;
               result = counter;
            } else {
               result = elem->second;
            return result;

Thread pool

Once we have the thread safe data structure in place, we want to create a thread pool were we can send tasks. The thread pool will be responsible to assign each task to the next free thread. The reason for having a thread pool instead of spawning as many threads as tasks is two fold, first, the amount of work I can do is bounded by the speed at which I’m able to read from disk, so throwing more threads doesn’t seem to help here. Second, since all the threads must query the lookup table, if there are lots of them the synchronization (mutex locking and unlocking) could become heavier than the work per thread becoming a bottleneck.

Boost provides a nice thread pool by using the Boost::Asio library. I came to this pattern of usage by reading this and this but it happens that they are wrong in a subtle detail. As they are written, they only run one task per thread and then the io_service is stopped. After scratching my head for a couple of hours I reread the official documentation and the solution is explained at them botom of the page. So the key issue is to destroy explicitly the work variable that we created for the io_service to not end too early. To accomplish that just embed the work in a smart pointer std::auto_ptr and reset it when necessary, the reset will call the work destructor.

So the main program would be something like this

// Thread pool
asio::io_service io_service;
boost::thread_group threads;
auto_ptr<asio::io_service::work> work(new asio::io_service::work(io_service)); 

// Spawn enough worker threads
int cores_number = boost::thread::hardware_concurrency(); 
for (std::size_t i = 0; i < cores_number; ++i){
    threads.create_thread(boost::bind(&asio::io_service::run, &io_service));
// Post the tasks to the io_service
for(vector<string>::iterator it=filenames.begin();it!=filenames.end();it++){

and the code for the worker (sketched)

struct translator {
    translator(string filename, LookupTable& dict)
        : m_filename(filename),m_dict(dict)
    void operator()(){
Categories: Programming Tags: , ,

Parser combinators in Scala

February 23rd, 2012 1 comment

Lately, I’ve been writing some matrix factorization code for a video recommender system. I’m a big fan of Test Driven Development but it is a technique that is difficult for me to apply to the numerical or algorithmic domain. In order to make sure that the code I write is correct what I’m doing is the Oracle Test Pattern (I’m sure there is a correct name for this but I’m not able to find it right now). The idea is to have a reference implementation that solves the exact problem you are trying to solve. In my case, that reference implementation does not exists but I’m writing a straightforward unoptimized version of the script in Python using the Numpy libraries. That will be my ground truth when comparing results with the highly optimized parallel Scala version of the algorithm.

The problem

So the problem is that I’m working interactively coding the algorithm in an iPython session and I’m getting results of this kind

In [4]: x
matrix([[ 0.03893565, -0.35836827, -2.06492572,  1.49773613, -1.01988835,
         -0.20590096, -0.19658741],
        [ 0.43055155,  1.07532444,  0.89299596, -1.070371  , -0.24015718,
          0.04521229, -1.39209522],
        [-0.4482701 ,  0.15201451, -1.42824771,  1.13859559,  0.66432642,
          0.51184435,  0.52637519],
        [-0.26518471, -1.14331753, -1.15492029, -0.27501194,  1.73750282,
         -1.4118682 ,  0.14701005],
        [-1.6577536 , -0.0781593 , -0.01558478,  0.67277257, -0.07249647,
          0.70946581, -0.82349608]])

and I would like to just copy and paste this string and use it as the expected value in my Scala tests. That would involve to remove “matrix”, all the “[” and “]”, substitute them for their Array() equivalents and put an F at the end of each number denoting that it is a Float. Too much work.

The solution

If there is an area where functional languages shine is in creating DSLs. More specifically creating an internal DSL that looks more like an external DSL. In Scala you can use the Parser Combinator libraries that are part of the Scala’s core libraries. Such parser will be something like

class PyMatrixParser extends JavaTokenParsers {
  def matrix:Parser[Array[Array[Float]]] = "matrix([" ~> repsep(row,",") <~ "])" ^^ (_.toArray)
  def row:Parser[Array[Float]] = "["~> repsep(floatingPointNumber,",") <~ "]" ^^ (

using this parser is then just a matter of

val parser = new PyMatrixParser
val scalaMatrix = parser.parseAll(parser.matrix, theStringMatrix).get

quite good result for just 2 lines of code defining the parser.

This is the quick overview for those not familiar with Scala’s syntax

  • Every String that is found where a Scala Parser was meant to be, is automatically transformed into a constant parser that matches that exact string. So for instance,“matrix([“gets converted into a parser by one of the implicits in the parser combinators libraries.
  • rep(row,”,”) takes two parsers as parameters and means that parser #1 will be repeated 0 or more times interleaved by parser #2
  • The parser combinators “~>” and “<~" denote that the parser pointed by the arrow must be keep as the result of the parsing while the parser pointed by the tail must be discarded. This is helpful for combining two parser where one of them is just ornamental.
  • floatingPointNumber is a parser provided by the library that parses float point string representations
  • Each parser returns either the parsed string or a more complex Scala structure, like for instance, a list of strings in the case of rep(parser1,parser2). Those results are sent to a parser transformator (the ^^ operator) that works on the parser results and generates the true parsing result. In the example, first we create an array of Floats, and then an Array of Arrays of Floats that represent my matrix.

Really cool feature that has spared me a lot of grunt work by just writing two lines of code.

Categories: Programming Tags: ,

Scala project template (SBT 0.11.2)

February 15th, 2012 No comments

Some months ago I wrote a post about setting up a Scala project with SBT. SBT has evolved and the new version differs from the 0.7.7 instructions. There is a Github repository with the updated version for SBT 0.11.2.

  1. Download the sbt jar and create the script as it is explained here
  2. Create the project folder
    mkdir sbt-project-template
    cd sbt-project-template
  3. Create the project definition with the build.sbt file that must reside in the project’s root directory
    name := "TestProject"
    version := "1.0-SNAPSHOT"
    organization := "com.example"
    scalaVersion := "2.9.1"
    libraryDependencies += "org.scalatest" %% "scalatest" % "1.7.1" % "test"

    Blank lines in between are important. Don’t remove them.

  4. Let’s write some classes and tests in order to check that everything is OK. First create the class under test in src/main/scala/Calc.scala
    package com.example
    object Calc {
        def add(x:Int, y:Int) : Int = x + y
  5. Now, create a test in src/test/scala/CaltTest.scala that checks that the class is working as expected
    package com.example.test
    import com.example.Calc
    import org.scalatest.Suite
    class CalcTest extends Suite {
        def testAddition() {
            assert(4 === Calc.add(2,2))
  6. Now just check that tests pass by doing
    sbt clean test

    you should see something like this:

    [info] Loading project definition from /home/cebrian/GIT/sbt-project-template/project
    [info] Set current project to TestProject (in build file:/home/cebrian/GIT/sbt-project-template/)
    [success] Total time: 0 s, completed 14-feb-2012 23:29:30
    [info] Updating {file:/home/cebrian/GIT/sbt-project-template/}default-4639fd...
    [info] Resolving org.scala-lang#scala-library;2.9.1 ...
    [info] Resolving org.scalatest#scalatest_2.9.1;1.7.1 ...
    [info] Done updating.
    [info] Compiling 1 Scala source to /home/cebrian/GIT/sbt-project-template/target/scala-2.9.1/classes...
    [info] Compiling 1 Scala source to /home/cebrian/GIT/sbt-project-template/target/scala-2.9.1/test-classes...
    [info] CalcTest:
    [info] - testAddition
    [info] Passed: : Total 1, Failed 0, Errors 0, Passed 1, Skipped 0
    [success] Total time: 7 s, completed 14-feb-2012 23:29:37

Simpler than before.

Categories: Programming Tags: , ,

CSV parser in C++ with Boost and Template Metaprogramming

December 28th, 2011 No comments

One common situation when analyzing big chunks of data is to parse big CSV files with a record structure on each line. That is, all lines conform to a fixed schema where each line represents a record and each record has several columns of different types. Your objective is to parse and fill a data structure like this

struct foo {
    int a;
    std::string b;
    double c;

for each record. All the information needed for such parsing is in this data structure so I wondered whether I could write a program when you pass that type information and the parsing is done automatically for you. This is my first attempt at getting a flexible and fast CSV parser in C++. It is hosted in Github.


When parsing CSV files the common usage pattern is to iterate through each line and perform a given action on each record. No need for a big CSV container or something similar, so the best approach is to write a class that acts as an iterator. When derreferencing the iterator it will return the provided data structure filled with parsed data from the current line.

An iterator is associated to a container but in this case, it will be constructed accepting an std::istream representing the CSV file. By accepting this istream we will be able to parse strings using the std::stringstream class, regular files, or compressed files using the Boost Iostream library.

The iterator must have all the common operators for it to interoperate with the STL algorithms seamlessly. The empty constructor will mark the iterator’s end-of-range position that coincides with the end of file. A typical use case will be to have some code like this:

#include <fstream>
#include <boost/tuple/tuple.hpp>
#include <csv_iterator.hpp>

using namespace boost::tuples;
typedef tuple<int,std::string,double> record;

void myFunc(const record& value){
    // Your code here

int main(int argc, const char *argv[])
    // Example of csv_iterator usage
    std::ifstream in("myCsvFile.csv");
    csv::iterator<record> it(in);

    // Use the iterator in your algorithms.
    std::for_each(it, csv::iterator<record>(), myFunc);

    return 0;

for parsing CSV files like this:



For obtaining the values on each line, the library uses the Boost Tokenizer library. From a string you create a tokenizer that splits the input string by the character delimiter that by default is the comma. It also takes care of escaping characters. Accessing the different tokens is granted by the tokenizer by providing a token iterator.

using namespace boost;
typedef tokenizer<escaped_list_separator<char> > Tokens;
Tokens tokens(currentLine);
// Tokens can be accessed using tokens.begin() and tokens.end() iterators

Once we have the strings representing different values we have to parse and convert them into a type. Bad data format exceptions happen here and can be spotted earlier. For parsing using an unified approach the library uses the Boost Lexical Cast Library. This library provides a uniform and portable interface for doing text conversions

int myInt = boost::lexical_cast<int>("5");
double myDouble = boost::lexical_cast<double>("3.14");

For the record data structure the library uses the Boost Tuple Library that provides a Plain “old” Data Structure for storing the record fields. Moreover this class provides some template infrastructure that helps in the metaprogramming trick that follows.

Template metaprogamming

For our library the number of columns and the datatype is implicit in the record type. Our algorithm for parsing is basic, depending on the field type, parse the Nth string accordingly, and assign the result to the Nth field. Repeat for all the record fields. This parametric polymorphism must be combined with the dynamic access of the string tokenization. The former can be obtained in C++ using Template metaprogramming. The code that makes the trick is this one

template<class Tuple, int N >
    struct helper {
        static void fill(Tuple& tuple, strIt& it, strIt& end){
            using namespace boost::tuples;
            typedef typename element<length<Tuple>::value-N-1,Tuple>::type value_type;
            get<length<Tuple>::value-N-1>(tuple) = boost::lexical_cast<value_type>(it->c_str());

template<class Tuple>
    struct helper<Tuple, 0> {
        static void fill(Tuple& tuple, strIt& it, strIt& end){
            using namespace boost::tuples;
            typedef typename boost::tuples::element<length<Tuple>::value-1,Tuple>::type value_type;
            boost::tuples::get<length<Tuple>::value-1>(tuple) = boost::lexical_cast<value_type>(it->c_str());

template<class Tuple>
    struct filler {
       static void fill(Tuple& tuple, strIt&& it,strIt&& end){
            helper<Tuple, boost::tuples::length<Tuple>::value-1>::fill(tuple,it,end);

Yes, C++ syntax sucks!! But basically what we are doing here is a common pattern of functional programming, tail recursion. We define structures that contain static functions. The filler structure just initializes the recursive call by instantiating an instance of the helper paremeterized structure setting the length recursion to the number of fields in the tuple. This structure has to be specialized for the 0 value in order for the recursion to stop. All this functionality is done behind the curtain by the compiler (compilation time increases when using template metaprogramming) but the generated code will be something very similar to this pseudocode:

boost::tuples::get<0>(tuple) = (casting_here)*it; 
boost::tuples::get<1>(tuple) = (casting_here)*it; 
boost::tuples::get<2>(tuple) = (casting_here)*it; 

almost identical to the code we would have written by hand. The compiler is the one who knows how many fields our structure has at compile time and generates as many efficient instructions as needed.


I wanted to stretch my programming muscles by coding a generic, flexible and fast CSV parser in C++ using template metaprogramming. The generic and flexible parts of the task have been accomplished but not the performance objectives. Although the library is fast enough for most tasks, it isn’t in a scenario of Big Data parsing big files. The tokenizer iterator is incurring a performance penalty each time I try to derreference it, since it creates and allocates memory for a std::string each time we invoke *it. This memory allocation is a performance drain doing useless work because we need this information only for parsing and getting a value, but the string is discarded thereafter. It would be better to perform an in-place parsing using the string allocated when reading the lines of the file. To that end it will be enlightening in the future to try this same exercise with more performant string libraries like the C++ String Toolkit and see the differences in performance.

Categories: Programming Tags: , , ,

Implementing LDA

November 28th, 2011 No comments

Lately I was playing with Latent Dirichlet Allocation (LDA) for a project at work. If for whatever reason you need to implement such algorithm perhaps you will save some time reading the walkthrough I did.

First you must be sure that LDA is the algorithm you are looking for. From a corpus of documents you will get K lists with words from your documents in them with a number assigned to each word denoting the relevance of the given word in the lists. Each list represents a topic, and that would be your topic description, no fancy words like “Computers”, “Biology” or “Life meaning”, just a set of words that a human must interpret. You could always assign a single name by picking the most prominent word in the list or treating the list as a valued vector and comparing it against a canonical topic description. So take a look at the first examples in this presentation and get inspired.

OK so you need some code to test how this method behaves with your particular data. The first thing to try is the topicmodels package from the R   statistical software package. This can give you an idea of the method and try to use it in a more serious Java application by means of the Mallet library.

But say that you need to create your own implementation because Java horrifies you or because you need a parallel version or whatever the reason. The first thing you have to do is to choose the inference method of your model between variational methods or gibbs sampling. This post will give you some ideas for picking the right method for your particular problem. The original papers picked the variational approach but I went through the Gibbs sampling method because I found this paper where all the mathematical derivations are nailed down. That way I was able to fully understand the method and at the same time being sure that my implementation was right and sound.  If you need more guidance, take a look at this simple implementation for getting an idea of the main functions and data structures you’ll have to code.

Once you have your code written you will have to check whether it is correct or not. The example in this paper using pixel positions and pixel intensities instead of words and word counts is very illustrating and will show visually the correctness of your implementation. Once you have your algorithm up and running perhaps you want to scale it up to more machines, so you could benefit from reading this paper  and taking also a look at this blog post from Alex Smola and their distributed implementation of LDA on Github.

Happy coding!!!

Getting started with JAGS for Bayesian Modelling

September 13th, 2011 8 comments

In the past if you wanted to do some Bayesian Modelling using a Gibbs sampler you had to rely on Winbugs but it isn’t open source and it only runs in Windows. The JAGS project started as a full feature alternative for the Linux platform. Here are some instructions for getting started

First install the required dependencies. In my Ubuntu 11.04, is something like:

sudo apt-get install gfortran liblapack-dev libltdl-dev r-base-core

For Ubuntu 11.04 you have to install JAGS from sources, but it seems that this version will be packaged in the next Ubuntu release. Download the software from here and install.

tar xvfz JAGS-3.1.0.tar.gz
cd JAGS-3.1.0
sudo make install

Now fire R and install the interface package rjags

$ R
> install.packages("rjags",dependencies=TRUE)

Now let’s do something interesting (although pretty simple). Let’s assume we have a stream of 1s and 0s with an unknown proportion of each one. From R we can generate such distribution with the command

> points <- floor(runif(1000)+.4)

that generates a distribution with roughly 40% of 1s and 60% of 0s. So, our stream consists of a sequence of 0s and 1s generated using the uniform(phi) distribution where the phi parameter equals 0.4.

If we don’t know this parameter and we try to learn it, we can assume that this parameter has prior uniform in the range [0,1] and thus the model that describes this scenario in the Winbugs language is

    phi ~ dunif(0,1);
    y ~ dbin(phi,N);

In this model N and y are known, so we provide this information in order to estimate our unknown parameter phi. We create the model and query the resulting parameter distribution:

> result <- list(y=sum(points), N=1000)
> result
[1] 393

[1] 1000

> library(rjags)
Loading required package: coda
Loading required package: lattice
linking to JAGS 3.1.0
module basemod loaded
module bugs loaded
> myModel <- jags.model("model.dat", result)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 5

Initializing model

> update(myModel, 1000)
  |**************************************************| 100%
> x <- jags.samples(myModel, c('phi'), 1000)
  |**************************************************| 100%
> x
[1] 0.3932681

Marginalizing over: iteration(1000),chain(1) 


So the inferred value of phi is 0.3932. One interesting thing in Bayesian statistics is that it does not estimate points, but probabilistic distributions over the parameters. We can see how the phi parameter was estimated by examining the Monte Carlo Chain and the distribution of the generated values during the simulation

> chain <- as.mcmc.list(x$phi)
> plot(chain)

Where we can see that the values for phi in the chain were centered around the 0.4, the true parameter value.

Categories: Data Mining Tags: , , ,

My working environment with Xmonad

September 5th, 2011 6 comments

Fire a terminal, fire another terminal and tail some logs, open your browser and point to the web page you are developing, open your IDE and open three or four tabs with the code you suspect is causing the bug, put some breakpoints, alt-tab to the first terminal start your system under test, connect your debugging IDE to your system, perform some operations to your browser, catch the breakpoint, switch back and forth the code tabs, etc… This daily routine common to most developers, involves grabbing the mouse, arranging some windows, and switching context continuously. This creates a cognitive overload and a lack of productivity because developers are doing tasks not directly related to the task at hand.

This is the reason I don’t use Gnome any more and I’ve switched to Xmonad a tiling window manager that can be controlled almost exclusively with the keyboard. With this fully configurable window manager, I can move, resize, minimize, arrange, customize all the visible windows, move windows between workspaces, all with my hands not leaving the keyboard.The only thing I have not been able to accomplish is having the UrgentHook working for Skype. The Linux version of Skype fails to set the WM_URGENT X11 event when a new chat opens, and if I’m not in that workspace I don’t get any notification besides the bell. Still thinking about a good workaround, any ideas?

Here is a screenshot of xmonad in action with  some applications in it,

If you plan to install Xmonad in your computer, use version 0.10 or superior because it solves some nasty problems with Java applications. In case it is not yet ready for your favourite distribution, follow these instructions. In order to have this configuration working, just write the following 3 files:

  • .xsession with the programs you need to start when the session begins (network manager, batery manager, etc…). Put it in your $HOME dir.
  • .xmobarrc with the configuration of your handy textual status bar . Put it in your $HOME
  • xmonad.hs with the configuration of the window manager itself. Put it under $HOME/.xmonad

xmonad.hs is a pure Haskell file for configuring the window manager, no XML, not another fancy configuration language. Some comments to the configuration file

  • lines 23-26, send Thunderbird and Skype to their dedicated workspaces
  • line 29, name the workspaces
  • lines 32-50, define new key combinations, for navigating the tiling windows, send windows to background and toggle between the tiled arrangement and focusing the whole screen into one window
  • lines 52-55, define how I want the windows to be arranged. Basically, create a specific configuration for Skype in its dedicated workspace, and for the rest of workspaces, don’t hide the menu, allow navigation with the cursors and minimize unwanted windows.
  • lines 57-77, ensemble the main xmonad window manager with the desired configuration. Spawn the status bar, and append the predefined layouts, keybindings and window hooks. Redefine some keys and fool Java setting the name of the Window Manager to LG3D in order to avoid problems with focus.
import XMonad
import XMonad.Hooks.DynamicLog
import XMonad.Hooks.ICCCMFocus
import XMonad.Hooks.ManageDocks
import XMonad.Util.Run(spawnPipe)
import XMonad.Util.EZConfig(additionalKeys)
import System.IO
import XMonad.Hooks.ManageHelpers
import XMonad.Hooks.UrgencyHook
import XMonad.Hooks.SetWMName
import XMonad.Layout.Minimize
import XMonad.Layout.WindowNavigation
import XMonad.Layout.ToggleLayouts
import XMonad.Layout.IM as IM
import XMonad.Layout.PerWorkspace
import XMonad.Layout.Reflect
import XMonad.Layout.Grid
import Data.Ratio ((%))

import qualified Data.Map as M

-- Send applications to their dedicated Workspace
myManageHook = composeAll
                [ className =? "Skype"         --> doShift "4:skype",
                  className =? "Thunderbird"   --> doShift "2:mail"

-- Name the workspaces
myWorkspaces = ["1:dev","2:mail","3:web","4:skype","5:media", "6:office"] ++ map show [7..9]

-- Add new Keys
newKeys x = M.union (keys defaultConfig x) (M.fromList (myKeys x))

myKeys conf@(XConfig {XMonad.modMask = modm}) =
              -- Minimize a window
                ((modm, xK_z),               withFocused minimizeWindow)
              , ((modm .|. shiftMask, xK_z), sendMessage RestoreNextMinimizedWin  )
              -- Window navigation with cursors
              , ((modm,                 xK_Right), sendMessage $ Go R)
              , ((modm,                 xK_Left ), sendMessage $ Go L)
              , ((modm,                 xK_Up   ), sendMessage $ Go U)
              , ((modm,                 xK_Down ), sendMessage $ Go D)
              , ((modm .|. controlMask, xK_Right), sendMessage $ Swap R)
              , ((modm .|. controlMask, xK_Left ), sendMessage $ Swap L)
              , ((modm .|. controlMask, xK_Up   ), sendMessage $ Swap U)
              , ((modm .|. controlMask, xK_Down ), sendMessage $ Swap D)
              -- Togle Fullscreen
              , ((modm,                 xK_f    ), sendMessage ToggleLayout)

-- Define the default layout
skypeLayout = IM.withIM (1%7) (IM.And (ClassName "Skype")  (Role "MainWindow")) Grid
normalLayout = windowNavigation $ minimize $ avoidStruts $ onWorkspace "4:skype" skypeLayout $ layoutHook defaultConfig
myLayout = toggleLayouts (Full) normalLayout

-- Main executable
main = do
    xmproc <- spawnPipe "xmobar /home/cebrian/.xmobarrc"
    xmonad $ withUrgencyHook NoUrgencyHook $ defaultConfig
        { manageHook = manageDocks <+> myManageHook <+> manageHook defaultConfig
        , keys = newKeys
        , workspaces = myWorkspaces
        , layoutHook = myLayout
        , logHook = takeTopFocus >> dynamicLogWithPP xmobarPP
                        { ppOutput = hPutStrLn xmproc
                        , ppTitle = xmobarColor "green" "" . shorten 50
                        , ppUrgent = xmobarColor "yellow" "red" . xmobarStrip
        , modMask = mod4Mask     -- Rebind Mod to the Windows key
        , terminal = "terminator"
        , startupHook = setWMName "LG3D"
        } `additionalKeys`
        [ ((controlMask .|. shiftMask, xK_l), spawn "xscreensaver-command -lock")
        , ((controlMask, xK_Print), spawn "sleep 0.2; scrot -s")
        , ((0, xK_Print), spawn "scrot")



xrdb -merge .Xresources

# Configure xrandr for multiple monitors
# External output may be "VGA" or "VGA-0" or "DVI-0" or "TMDS-1"
# EXTERNAL_LOCATION may be one of: left, right, above, or below

# In case I want to have both monitors on
               EXTERNAL_LOCATION="--left-of $INTERNAL_OUTPUT"
               EXTERNAL_LOCATION="--right-of $INTERNAL_OUTPUT"
               EXTERNAL_LOCATION="--left-of $INTERNAL_OUTPUT"
xrandr | grep $EXTERNAL_OUTPUT | grep " connected "

if [ $? -eq 0 ]; then
    xrandr --output $INTERNAL_OUTPUT --off --output $EXTERNAL_OUTPUT --auto
    # Alternative command in case of trouble:
    # (sleep 2; xrandr --output $INTERNAL_OUTPUT --auto --output $EXTERNAL_OUTPUT --auto $EXTERNAL_LOCATION) &
    xrandr --output $INTERNAL_OUTPUT --auto --output $EXTERNAL_OUTPUT --off

trayer --edge top --align right --SetDockType true --SetPartialStrut true --expand true --width 15 --height 12 --transparent true --tint 0x000000 &

xscreensaver -no-splash &

# Allow nautilus to take care of plugin USB drives and Dropbox icons
nautilus --no-desktop -n &

if [ -x /usr/bin/nm-applet ] ; then
   nm-applet --sm-disable &

if [ -x /usr/bin/gnome-power-manager ] ; then
   sleep 1
   gnome-power-manager &

/usr/bin/gnome-volume-control-applet &
dropbox start &

exec /home/cebrian/.cabal/bin/xmonad


Config { font = "-*-Fixed-Bold-R-Normal-*-13-*-*-*-*-*-*-*"
       , bgColor = "black"
       , fgColor = "grey"
       , position = TopW L 85
       , commands = [ Run Cpu ["-L","3","-H","50","--normal","green","--high","red"] 10
                    , Run Memory ["-t","Mem: <usedratio>%"] 10
                    , Run Swap [] 10
                    , Run Date "%a %b %_d %Y %H:%M:%S" "date" 10
                    , Run StdinReader
       , sepChar = "%"
       , alignSep = "}{"
       , template = "%StdinReader% }{ %cpu% | %memory% * %swap%    <fc=#ee9a00>%date%</fc>"

And don’t forget to add an entry for your new Xmonad desktop in /usr/share/applications/xmonad.desktop changing the path to your appropiate home

[Desktop Entry]
Exec="YOUR HOME HERE"/.xsession

And to /usr/share/xsessions/xmonad.desktop

[Desktop Entry]
Comment=Lightweight tiling window manager
Exec="YOUR HOME HERE"/.xsession
Categories: Uncategorized Tags: , , ,