## Introduction to the Minhash algorithm

Say that you have a collection of N objects and you want to want to compare each object against each other. Maybe you are interested in the distance between objects in an Euclidean Space or just the number of attributes or features they share. If you need to perform each comparison, you’ll need to perform and that’s a problem when your N is in the order of the millions or tens of millions, likely the order of your user or item company’s database. Is there a way to perform less comparisons when we aren’t interested in the whole set of comparison but just in finding the set of closest items under a given measure? Yes, the Minhash Algorithm.

In this post we are going to be focused in the Jackard Coefficient for determining the closeness between two objects. Each of the N object will have a set of m features where usually is very large. You can think of the objects in N as the members of a social network or Amazon customers, and the set of features that describe them all the friends they have in the social network or all the books they previously purchased in the second case. Comparing two users would imply comparing their feature sets using the Jackard Coefficient formula:

$$JackardCoeff(A,B)=\frac{|A\cap B|}{|A\cup B|}$$

### Understanding the Minhash Algorithm

All the MinHash functionality relies on this mathematical property

$$P(\min\pi(A)=\min\pi(B)) = \frac{|A\cap B|}{|A\cup B|}$$

Where is a function that creates a permutation of its set argument. This reads as follow, Given a random function over the orderings of the elements of a universe U to which A and B belongs, the probability that the minimum element on both sets coincides is equal to the Jackard Coefficient defined as the number of elements in the intersection of the sets over the number of elements in the union.

What does this mean? Let’s break down the constituent parts. The function is a function that transforms the natural ordering of a set into a new ordering. There are lots of different functions that create different orders for a given set. Let’s see this with an example.

Say that our set is the set of vowels in the alphabet . A natural ordering of the elements would be . Another different ordering could be a function that given the set of vowels V generates the following ordering . For convenience and since the ordering is a Total Order, we can map the elements of the set to the natural numbers taking the position of the element in the ordering. For the natural ordering of the vowels, , we have that

 a 1 e 2 i 3 o 4 u 5

and with the other ordering the vowels map to

 a 5 e 2 i 3 o 1 u 4

Thus given an ordering, if we apply it to a subset of the universe of objects we get their positions in the new ordering. For instance:
$$\pi_X(\{e,i,o\})=\{2,3,1\}$$

So going back to the original formula, the probability that the minimal value of the transformation of each set, say and , according to a new orderding, , is equal to the Jackard Coefficient. In this case we can compute the Jackard Coefficient exactly and it is:
$$\frac{|\{e,i,o\}\cap\{a,e\}|}{|\{e,i,o\}\cup\{a,e\}|} =\frac{|\{e\}|}{|\{a,e,i,o\}|} = \frac{1}{4}$$

This number means that whenever we pick a transformation from the set of all possible transformations the probability that the smaller number is the same in both sets is equal to 0.25. Think that we are talking here about a probability and that it is not a rule. If the permutation choosen at random was the previously defined function then we would have, that is not the same value as .

### Hashing functions

Defining a random ordering for all the elements of a universe is costly. We have to maintain in memory the tables we saw for the vowels and that can not be practical when the universe of objects is large. So instead of maintaining the explicit random permutation we use the implicit permutation that hashing functions with good properties give us. The good properties we need is that the probability of collision in the universe is low because a random permutation is a 1 to 1 relationship between elements and positions and thus collisions in the hashing function would break this property.

From now on, I’m going to assume that every universe we need to work with is smaller than the range of integers and thus restrict all the derivations to integer since every set and ordering could be mapped to the natural integer ordering ().

Possible hashing functions for integers could be:

• Use the FNV-1 hashing function of the string representation of your integer. This has the advantage that naturally deals with hashing sets of integers by means of concatenating their string  representations.
• Use a pseudo-random number generator by picking a large prime number, say and a set of coefficients such our hashing functions are defined as

In this post we will use the second kind of functions expresed in this Haskell code:

p :: Int
p=2147483647

coefs :: [Int]
coefs = unfoldr (Just . randomR (0,p)) $mkStdGen 0 hashFuncs :: [Int->Int] hashFuncs = go coefs where go theCoefs = let (a:b:[],rest) = splitAt 2 theCoefs in (\x -> (a*x+b) mod p) : go rest So, given a hashing function, which is the minimum hash over a given set? This code gives you that result minHash :: (Int->Int) -> [Int] -> Int minHash f = foldl' hashAndMin maxBound where hashAndMin a b = min a (f b) ### Shingles OK, now you want to compare two sets, A and B to see how similar they are. The similarity measure here will be the number of items they both share, the Jackard coefficient. The time complexity of the set intersection is if you first build a Hash Map and then query it with the second set for existence. If both sets are already sorted you can compare them in . Can you compare them faster? Yes, but only if you want to trade accuracy. The property in the probability formula at the beginning of the post states that with probability the Jackard coefficient the min hashes of the two sets will be equal. That means that if we take r different hashing functions (permutations) and we name s the Jackard coefficient between the tow sets, the probability that all k hashes of each set are equal is . Since this is a collection of Bernoulli trials, the distribution of having k min hashes equal in both sets follows a Binomial distribution with density probability function equal to: $$f(k;r,s)= {r \choose k}s^k(1-s)^{r-k}$$ It is known that a random variable distributed according to the binomial distribution, has , so a good estimator of would be to divide the number of matches by the number of hashing functions. This would give you the estimated value of the Jackard coefficient. But before comparing two fingerprints we have to compute them: shingle :: Int -> [Int] -> [Int] shingle c nums = [minHash f nums | f <- take c hashFuncs] The shingle function gives us a set of c fingerprints of our original big set of numbers. So let’s see it in action. Imagine that we want to compute the Jackard coefficient between the bag of letters in the phrase Humpty Dumpty sat on a wall and Humpty Dumpty had a great fall. First let’s compute the Jackard coefficient according to the definition import Data.List import System.Random import Data.Char import Data.List phrase1="Humpty Dumpty sat on a wall" phrase2="Humpty Dumpty had a great fall" mkSet = nub . filter isAlpha . map toLower -- Set 1 is the collection of letters in the first phrase -- "humptydsaonw" set1 = map ord$ mkSet phrase1 -- "humptydsaonwl"
-- Set 2 is the collection of letters in the second phrase
-- "humptydagrefl"
set2 = map ord $mkSet phrase2 -- "humptydagrefl" jackard = a/b where a = fromIntegral$  length (set1 intersect set2) ::Double
b = fromIntegral $length (set1 union set2) ::Double p :: Int p=2147483647 coefs :: [Int] coefs = unfoldr (Just . randomR (0,p))$ mkStdGen 0

hashFuncs :: [Int->Int]
hashFuncs = go coefs
where
go theCoefs = let (a:b:[],rest) = splitAt 2 theCoefs
in (\x -> (a*x+b) mod p) : go rest

minHash :: (Int->Int) -> [Int] -> Int
minHash f = foldl' hashAndMin maxBound
where
hashAndMin a b = min a (f b)

shingle :: Int -> [Int] -> [Int]
shingle c nums = [minHash f nums | f <- take c hashFuncs]

shingles1 = map (\x -> shingle x set1) [1..]
shingles2 = map (\x -> shingle x set2) [1..]

jackard2 = map approxJackard shingles
where
shingles = zip shingles1 shingles2
approxJackard :: ([Int],[Int]) -> Double
approxJackard (as,bs) = let pairs = zip as bs
matches = filter (\(a,b) -> a == b) pairs
num = fromIntegral $length matches :: Double den = fromIntegral$ length as :: Double
in num/den

If we plot jackard2 we have an approximation based on the number of shingles:

### Lost?

OK, let me summarize a little bit.

1. You came here because you have a problem of comparing and object against N different objects and N is very large.
2. The comparison you want to perform is by computing intersections between the attributes of objects and compute how many are equal.
3. You want a cheap procedure for computing 2)

So the solution is to transform each object into a set of different values, and then perform the Jackard coefficient on the reduced set of values. For instance, each object could be the set of friends in a social network, and the value c could be for instance just 10. Here we reduce the computation of the intersection from the hundreds to just 10 values. Your friends got summarized in a set of 10 values.

Can we do better? Yes. Here we reduced the number of comparison from the cardinality of the number of features of the object to the small number c. But we still have to perform N comparisons against the N individuals in the population. Wouldn’t it be nice to just pick the individuals we think are more likely to have a high Jackard Coefficient? That’s possible with the use of bands but since this post is already very long, I’ll leave that for another post.

#### References

1. Minwise independent permutations The paper with the math that demonstrates the link between the Jackard Coefficient and the probability.
2. Chapter 3 of the excellent book “Minning massive datasets”

Categories: Data Mining Tags:

## Training Gradient Boosting Trees with Python

I’ve been doing some data mining lately and specially looking into Gradient Boosting Trees since it is claimed that this is one of the techniques with best performance out of the box. In order to have a better understanding of the technique I’ve reproduced the example of section 10.14.1 California Housing in the book The Elements of Statistical Learning. Each point of this dataset represents the house value of a property with some attributes of that house. You can get the data and the description of those attributes from here.

I know that the whole exercise here can be easily done with the R package gbm but I wanted to do the exercise using Python. Since learning several languages well enough is difficult and time consuming I would prefer to stick all my data analysis to Python instead doing it in R, even with R being superior on some cases. But having only one language for doing all your scripting, systems programming and prototyping PLUS your data analysis is a good reason for me. Your upfront effort of learning the language, setting up your tools and editors, etc. is done only once instead of twice.

## Data Preparation and Model Fitting

The first thing to do is to load the data into a Pandas dataframe

import numpy as np
import pandas as pd

columnNames = ['HouseVal','MedInc','HouseAge','AveRooms',
'AveBedrms','Population','AveOccup','Latitude','Longitud']



Now we have to split the datasets into training and validation. The training data will be used to generate the trees that will constitute the final averaged model.

import random

X = df[df.columns - ['HouseVal']]
Y = df['HouseVal']
rows = random.sample(df.index, int(len(df)*.80))
x_train, y_train = X.ix[rows],Y.ix[rows]
x_test,y_test  = X.drop(rows),Y.drop(rows)


We then fit a Gradient Tree Boosting model to the data using the scikit-learn package. We will use 500 trees with each tree having a depth of 6 levels. In order to get results similar to those in the book we also use the Huber loss function.

from sklearn.metrics import mean_squared_error,r2_score
params = {'n_estimators': 500, 'max_depth': 6,
'learn_rate': 0.1, 'loss': 'huber','alpha':0.95}


For me, the Mean Squared Error wasn’t much informative and used instead the R2 coefficient of determination. This measure is a number indicating how well a variable is able to predict the other. Values close to 0 means poor prediction and values close to 1 means perfect prediction. In the book, they claim a 0.84 against a 0.86 reported in the paper that created the dataset using a highly tuned algorithm. I’m getting a good 0.83 without much tunning of the parameters so it’s a good out of the box technique.

mse = mean_squared_error(y_test, clf.predict(x_test))
r2 = r2_score(y_test, clf.predict(x_test))

print("MSE: %.4f" % mse)
print("R2: %.4f" % r2)


## Data Analysis

Let’s plot how does the training and testing error behave

import matplotlib.pyplot as plt

# compute test set deviance
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)

for i, y_pred in enumerate(clf.staged_decision_function(x_test)):
test_score[i] = clf.loss_(y_test, y_pred)

plt.figure(figsize=(12, 6))
plt.subplot(1, 1, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-',
label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')


As you can see in the previous graph, although the train error keeps going down as we add more trees to our model, the test error remains more or less constant and doesn’t incur in overfitting. This is mainly due to the shrinkage parameter and one of the good features of this algorithm.

When doing data mining as important as finding a good model is being able to interpret it, because based on that analysis and interpretation preemptive actions can be performed. Although base trees are easily interpretable when you are adding several of those trees interpretation is more difficult. You usually rely on some measures of the predictive power of each feature. Let’s plot feature importance in predicting the House Value.

feature_importance = clf.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(12, 6))
plt.subplot(1, 1, 1)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()


Once variable importance has been identified we could try to investigate how those variables interact between them. For instance, we can plot the dependence of the target variable with another variable has been averaged over the values of the other variables not being taken into consideration. Some variables present a clear monotonic dependence with the target value, while others seem not very related to the target variable even when they ranked high in the previous plot. This could be signaling an interaction between variables that could be further studied.

from sklearn.ensemble.partial_dependence import plot_partial_dependence

fig, axs = plot_partial_dependence(clf, x_train,
features=[3,2,7,6],
feature_names=x_train.columns,
n_cols=2)

fig.show()


The last step performed was to explore the capabilities of the Python libraries when plotting data in a map. Here we are plotting the predicted House Value in California using Latitude and Longitude as the axis for plotting this data in the map.

from mpl_toolkits.basemap import Basemap
predDf = pd.DataFrame(x_test.copy())
predDf['y_pred'] = clf.predict(x_test)

def california_map(ax=None, lllat=31.5,urlat=42.5,
lllon=-124,urlon=-113):
m = Basemap(ax=ax, projection='stere',
lon_0=(urlon + lllon) / 2,
lat_0=(urlat + lllat) / 2,
llcrnrlat=lllat, urcrnrlat=urlat,
llcrnrlon=lllon, urcrnrlon=urlon,
resolution='f')
m.drawstates()
m.drawcountries()
m.drawcoastlines(color='lightblue')
return m

plt.figure()
m= california_map()
predDf = predDf.sort('y_pred') # Useful for plotting
x,y = m(predDf['Longitud'],predDf['Latitude'])
serieA = (np.array(predDf['y_pred']) - predDf['y_pred'].min())/(predDf['y_pred'].max()-predDf['y_pred'].min())
# z = plt.cm.jet(serieA)
z = np.array(predDf['y_pred'])/1000
m.scatter(x,y,c=z,s=60,alpha=0.5,edgecolors='none')
c = m.colorbar(location='right')
c.set_label("House Value (Thousands of $)") m.plot() plt.show()  ## Addendum This blog post was written using Pylit as the tool for doing Literate Programming. The code can be seen here. I think that literate programming is way superior to other software methodologies like TDD when coding algorithms for data analysis. The main problem I find right now is the lack of proper tooling for really taking advantage of literate programming, but this is a technique that I’m definitely going to research deeper. Categories: Data Mining Tags: ## Thread pool in C++ May 23rd, 2012 No comments I needed to convert user ids spread across a lot of files into a fixed range [0..N] where N was the total number of Ids in the dataset. First I though that since those files came from a Hadoop cluster I should write a MR job to do the task. But since recoding ids needs a “central authority” giving unique ids without collision, MapReduce wasn’t an option because MR thinks about each record as independent of the rest of records, so coordinating the assignment of ids is both difficult and unnatural in MapReduce. The naïve approach is to create a counter, loop through all the ids and whenever an id is not in the dictionary, use the counter as the new translation. See the pseudocode int counter = 0; for id in ids: if id not in dict: dict[id] = counter counter++ print dict[i]  But then you are wasting precious cores that could help you. My machine has eight cores, so for a task that runs in aprox 60 minutes, so after investing time in going beyond the naïve approach, I’m able to lower it to 10 minutes. That means that I can run tests 6 times faster, it will pay off. ### Lookup table So the first thing to do is to create a thread safe Hash Map. Most of the time the access will be for reading a value and less frequently for updating the data structure (in my problem I perform 250 reads for each write) so this scenario is ideal for a Readers-writer lock. I use the Boost Thread library with its boost::shared_mutex for getting the multiple access functionality. The class is something like this using namespace boost; class LookupTable { private: typedef std::unordered_map<int,unsigned int> HashMap; HashMap dict; unsigned int counter; shared_mutex m_mutex; public: LookupTable(){}; unsigned int translate(int number){ boost::shared_lock<boost::shared_mutex> lck(m_mutex); unsigned int result; HashMap::iterator elem = dict.find(key); if( elem == dict.end()){ lck.unlock(); boost::upgrade_lock<boost::shared_mutex> uLck(m_mutex); boost::upgrade_to_unique_lock<boost::shared_mutex> uuLck(uLck); dict[key] = counter; result = counter; counter++; } else { result = elem->second; } return result; } };  ### Thread pool Once we have the thread safe data structure in place, we want to create a thread pool were we can send tasks. The thread pool will be responsible to assign each task to the next free thread. The reason for having a thread pool instead of spawning as many threads as tasks is two fold, first, the amount of work I can do is bounded by the speed at which I’m able to read from disk, so throwing more threads doesn’t seem to help here. Second, since all the threads must query the lookup table, if there are lots of them the synchronization (mutex locking and unlocking) could become heavier than the work per thread becoming a bottleneck. Boost provides a nice thread pool by using the Boost::Asio library. I came to this pattern of usage by reading this and this but it happens that they are wrong in a subtle detail. As they are written, they only run one task per thread and then the io_service is stopped. After scratching my head for a couple of hours I reread the official documentation and the solution is explained at them botom of the page. So the key issue is to destroy explicitly the work variable that we created for the io_service to not end too early. To accomplish that just embed the work in a smart pointer std::auto_ptr and reset it when necessary, the reset will call the work destructor. So the main program would be something like this // Thread pool asio::io_service io_service; boost::thread_group threads; auto_ptr<asio::io_service::work> work(new asio::io_service::work(io_service)); // Spawn enough worker threads int cores_number = boost::thread::hardware_concurrency(); for (std::size_t i = 0; i < cores_number; ++i){ threads.create_thread(boost::bind(&asio::io_service::run, &io_service)); } // Post the tasks to the io_service for(vector<string>::iterator it=filenames.begin();it!=filenames.end();it++){ io_service.dispatch(std::move(translator(*it,dict))); } work.reset();  and the code for the worker (sketched) struct translator { translator(string filename, LookupTable& dict) : m_filename(filename),m_dict(dict) { } void operator()(){ // DO YOUR WORKER ACTIVITY HERE ... return; } }  Categories: Programming Tags: ## Parser combinators in Scala February 23rd, 2012 1 comment Lately, I’ve been writing some matrix factorization code for a video recommender system. I’m a big fan of Test Driven Development but it is a technique that is difficult for me to apply to the numerical or algorithmic domain. In order to make sure that the code I write is correct what I’m doing is the Oracle Test Pattern (I’m sure there is a correct name for this but I’m not able to find it right now). The idea is to have a reference implementation that solves the exact problem you are trying to solve. In my case, that reference implementation does not exists but I’m writing a straightforward unoptimized version of the script in Python using the Numpy libraries. That will be my ground truth when comparing results with the highly optimized parallel Scala version of the algorithm. ### The problem So the problem is that I’m working interactively coding the algorithm in an iPython session and I’m getting results of this kind In [4]: x Out[4]: matrix([[ 0.03893565, -0.35836827, -2.06492572, 1.49773613, -1.01988835, -0.20590096, -0.19658741], [ 0.43055155, 1.07532444, 0.89299596, -1.070371 , -0.24015718, 0.04521229, -1.39209522], [-0.4482701 , 0.15201451, -1.42824771, 1.13859559, 0.66432642, 0.51184435, 0.52637519], [-0.26518471, -1.14331753, -1.15492029, -0.27501194, 1.73750282, -1.4118682 , 0.14701005], [-1.6577536 , -0.0781593 , -0.01558478, 0.67277257, -0.07249647, 0.70946581, -0.82349608]])  and I would like to just copy and paste this string and use it as the expected value in my Scala tests. That would involve to remove “matrix”, all the “[" and "]“, substitute them for their Array() equivalents and put an F at the end of each number denoting that it is a Float. Too much work. ### The solution If there is an area where functional languages shine is in creating DSLs. More specifically creating an internal DSL that looks more like an external DSL. In Scala you can use the Parser Combinator libraries that are part of the Scala’s core libraries. Such parser will be something like class PyMatrixParser extends JavaTokenParsers { def matrix:Parser[Array[Array[Float]]] = "matrix([" ~> repsep(row,",") <~ "])" ^^ (_.toArray) def row:Parser[Array[Float]] = "["~> repsep(floatingPointNumber,",") <~ "]" ^^ (_.map(_.toFloat).toArray) }  using this parser is then just a matter of val parser = new PyMatrixParser val scalaMatrix = parser.parseAll(parser.matrix, theStringMatrix).get  quite good result for just 2 lines of code defining the parser. This is the quick overview for those not familiar with Scala’s syntax • Every String that is found where a Scala Parser was meant to be, is automatically transformed into a constant parser that matches that exact string. So for instance,“matrix([“gets converted into a parser by one of the implicits in the parser combinators libraries. • rep(row,”,”) takes two parsers as parameters and means that parser #1 will be repeated 0 or more times interleaved by parser #2 • The parser combinators “~>” and “<~" denote that the parser pointed by the arrow must be keep as the result of the parsing while the parser pointed by the tail must be discarded. This is helpful for combining two parser where one of them is just ornamental. • floatingPointNumber is a parser provided by the library that parses float point string representations • Each parser returns either the parsed string or a more complex Scala structure, like for instance, a list of strings in the case of rep(parser1,parser2). Those results are sent to a parser transformator (the ^^ operator) that works on the parser results and generates the true parsing result. In the example, first we create an array of Floats, and then an Array of Arrays of Floats that represent my matrix. Really cool feature that has spared me a lot of grunt work by just writing two lines of code. Categories: Programming Tags: ## Scala project template (SBT 0.11.2) February 15th, 2012 No comments Some months ago I wrote a post about setting up a Scala project with SBT. SBT has evolved and the new version differs from the 0.7.7 instructions. There is a Github repository with the updated version for SBT 0.11.2. 1. Download the sbt jar and create the script as it is explained here 2. Create the project folder mkdir sbt-project-template cd sbt-project-template  3. Create the project definition with the build.sbt file that must reside in the project’s root directory name := "TestProject" version := "1.0-SNAPSHOT" organization := "com.example" scalaVersion := "2.9.1" libraryDependencies += "org.scalatest" %% "scalatest" % "1.7.1" % "test"  Blank lines in between are important. Don’t remove them. 4. Let’s write some classes and tests in order to check that everything is OK. First create the class under test in src/main/scala/Calc.scala package com.example object Calc { def add(x:Int, y:Int) : Int = x + y }  5. Now, create a test in src/test/scala/CaltTest.scala that checks that the class is working as expected package com.example.test import com.example.Calc import org.scalatest.Suite class CalcTest extends Suite { def testAddition() { assert(4 === Calc.add(2,2)) } }  6. Now just check that tests pass by doing sbt clean test  you should see something like this: [info] Loading project definition from /home/cebrian/GIT/sbt-project-template/project [info] Set current project to TestProject (in build file:/home/cebrian/GIT/sbt-project-template/) [success] Total time: 0 s, completed 14-feb-2012 23:29:30 [info] Updating {file:/home/cebrian/GIT/sbt-project-template/}default-4639fd... [info] Resolving org.scala-lang#scala-library;2.9.1 ... [info] Resolving org.scalatest#scalatest_2.9.1;1.7.1 ... [info] Done updating. [info] Compiling 1 Scala source to /home/cebrian/GIT/sbt-project-template/target/scala-2.9.1/classes... [info] Compiling 1 Scala source to /home/cebrian/GIT/sbt-project-template/target/scala-2.9.1/test-classes... [info] CalcTest: [info] - testAddition [info] Passed: : Total 1, Failed 0, Errors 0, Passed 1, Skipped 0 [success] Total time: 7 s, completed 14-feb-2012 23:29:37  Simpler than before. Categories: Programming Tags: ## CSV parser in C++ with Boost and Template Metaprogramming December 28th, 2011 No comments One common situation when analyzing big chunks of data is to parse big CSV files with a record structure on each line. That is, all lines conform to a fixed schema where each line represents a record and each record has several columns of different types. Your objective is to parse and fill a data structure like this struct foo { int a; std::string b; double c; };  for each record. All the information needed for such parsing is in this data structure so I wondered whether I could write a program when you pass that type information and the parsing is done automatically for you. This is my first attempt at getting a flexible and fast CSV parser in C++. It is hosted in Github. ## Design When parsing CSV files the common usage pattern is to iterate through each line and perform a given action on each record. No need for a big CSV container or something similar, so the best approach is to write a class that acts as an iterator. When derreferencing the iterator it will return the provided data structure filled with parsed data from the current line. An iterator is associated to a container but in this case, it will be constructed accepting an std::istream representing the CSV file. By accepting this istream we will be able to parse strings using the std::stringstream class, regular files, or compressed files using the Boost Iostream library. The iterator must have all the common operators for it to interoperate with the STL algorithms seamlessly. The empty constructor will mark the iterator’s end-of-range position that coincides with the end of file. A typical use case will be to have some code like this: #include <fstream> #include <boost/tuple/tuple.hpp> #include <csv_iterator.hpp> using namespace boost::tuples; typedef tuple<int,std::string,double> record; void myFunc(const record& value){ // Your code here } int main(int argc, const char *argv[]) { // Example of csv_iterator usage std::ifstream in("myCsvFile.csv"); csv::iterator<record> it(in); // Use the iterator in your algorithms. std::for_each(it, csv::iterator<record>(), myFunc); return 0; }  for parsing CSV files like this: 1,hola,3.14 2,adios,2.71  ## Implementation For obtaining the values on each line, the library uses the Boost Tokenizer library. From a string you create a tokenizer that splits the input string by the character delimiter that by default is the comma. It also takes care of escaping characters. Accessing the different tokens is granted by the tokenizer by providing a token iterator. using namespace boost; typedef tokenizer<escaped_list_separator<char> > Tokens; Tokens tokens(currentLine); // Tokens can be accessed using tokens.begin() and tokens.end() iterators  Once we have the strings representing different values we have to parse and convert them into a type. Bad data format exceptions happen here and can be spotted earlier. For parsing using an unified approach the library uses the Boost Lexical Cast Library. This library provides a uniform and portable interface for doing text conversions int myInt = boost::lexical_cast<int>("5"); double myDouble = boost::lexical_cast<double>("3.14");  For the record data structure the library uses the Boost Tuple Library that provides a Plain “old” Data Structure for storing the record fields. Moreover this class provides some template infrastructure that helps in the metaprogramming trick that follows. ### Template metaprogamming For our library the number of columns and the datatype is implicit in the record type. Our algorithm for parsing is basic, depending on the field type, parse the Nth string accordingly, and assign the result to the Nth field. Repeat for all the record fields. This parametric polymorphism must be combined with the dynamic access of the string tokenization. The former can be obtained in C++ using Template metaprogramming. The code that makes the trick is this one template<class Tuple, int N > struct helper { static void fill(Tuple& tuple, strIt& it, strIt& end){ using namespace boost::tuples; typedef typename element<length<Tuple>::value-N-1,Tuple>::type value_type; checkIteratorRange(it,end); get<length<Tuple>::value-N-1>(tuple) = boost::lexical_cast<value_type>(it->c_str()); ++it; helper<Tuple,N-1>::fill(tuple,it,end); } }; template<class Tuple> struct helper<Tuple, 0> { static void fill(Tuple& tuple, strIt& it, strIt& end){ using namespace boost::tuples; typedef typename boost::tuples::element<length<Tuple>::value-1,Tuple>::type value_type; checkIteratorRange(it,end); boost::tuples::get<length<Tuple>::value-1>(tuple) = boost::lexical_cast<value_type>(it->c_str()); ++it; }; }; template<class Tuple> struct filler { static void fill(Tuple& tuple, strIt&& it,strIt&& end){ checkIteratorRange(it,end); helper<Tuple, boost::tuples::length<Tuple>::value-1>::fill(tuple,it,end); } };  Yes, C++ syntax sucks!! But basically what we are doing here is a common pattern of functional programming, tail recursion. We define structures that contain static functions. The filler structure just initializes the recursive call by instantiating an instance of the helper paremeterized structure setting the length recursion to the number of fields in the tuple. This structure has to be specialized for the 0 value in order for the recursion to stop. All this functionality is done behind the curtain by the compiler (compilation time increases when using template metaprogramming) but the generated code will be something very similar to this pseudocode: boost::tuples::get<0>(tuple) = (casting_here)*it; ++it; boost::tuples::get<1>(tuple) = (casting_here)*it; ++it; boost::tuples::get<2>(tuple) = (casting_here)*it; ++it;  almost identical to the code we would have written by hand. The compiler is the one who knows how many fields our structure has at compile time and generates as many efficient instructions as needed. ## Conclusion I wanted to stretch my programming muscles by coding a generic, flexible and fast CSV parser in C++ using template metaprogramming. The generic and flexible parts of the task have been accomplished but not the performance objectives. Although the library is fast enough for most tasks, it isn’t in a scenario of Big Data parsing big files. The tokenizer iterator is incurring a performance penalty each time I try to derreference it, since it creates and allocates memory for a std::string each time we invoke *it. This memory allocation is a performance drain doing useless work because we need this information only for parsing and getting a value, but the string is discarded thereafter. It would be better to perform an in-place parsing using the string allocated when reading the lines of the file. To that end it will be enlightening in the future to try this same exercise with more performant string libraries like the C++ String Toolkit and see the differences in performance. Categories: Programming Tags: ## Implementing LDA November 28th, 2011 No comments Lately I was playing with Latent Dirichlet Allocation (LDA) for a project at work. If for whatever reason you need to implement such algorithm perhaps you will save some time reading the walkthrough I did. First you must be sure that LDA is the algorithm you are looking for. From a corpus of documents you will get K lists with words from your documents in them with a number assigned to each word denoting the relevance of the given word in the lists. Each list represents a topic, and that would be your topic description, no fancy words like “Computers”, “Biology” or “Life meaning”, just a set of words that a human must interpret. You could always assign a single name by picking the most prominent word in the list or treating the list as a valued vector and comparing it against a canonical topic description. So take a look at the first examples in this presentation and get inspired. OK so you need some code to test how this method behaves with your particular data. The first thing to try is the topicmodels package from the R statistical software package. This can give you an idea of the method and try to use it in a more serious Java application by means of the Mallet library. But say that you need to create your own implementation because Java horrifies you or because you need a parallel version or whatever the reason. The first thing you have to do is to choose the inference method of your model between variational methods or gibbs sampling. This post will give you some ideas for picking the right method for your particular problem. The original papers picked the variational approach but I went through the Gibbs sampling method because I found this paper where all the mathematical derivations are nailed down. That way I was able to fully understand the method and at the same time being sure that my implementation was right and sound. If you need more guidance, take a look at this simple implementation for getting an idea of the main functions and data structures you’ll have to code. Once you have your code written you will have to check whether it is correct or not. The example in this paper using pixel positions and pixel intensities instead of words and word counts is very illustrating and will show visually the correctness of your implementation. Once you have your algorithm up and running perhaps you want to scale it up to more machines, so you could benefit from reading this paper and taking also a look at this blog post from Alex Smola and their distributed implementation of LDA on Github. Happy coding!!! Categories: Tags: ## Getting started with JAGS for Bayesian Modelling September 13th, 2011 8 comments In the past if you wanted to do some Bayesian Modelling using a Gibbs sampler you had to rely on Winbugs but it isn’t open source and it only runs in Windows. The JAGS project started as a full feature alternative for the Linux platform. Here are some instructions for getting started First install the required dependencies. In my Ubuntu 11.04, is something like: sudo apt-get install gfortran liblapack-dev libltdl-dev r-base-core  For Ubuntu 11.04 you have to install JAGS from sources, but it seems that this version will be packaged in the next Ubuntu release. Download the software from here and install. tar xvfz JAGS-3.1.0.tar.gz cd JAGS-3.1.0 ./configure make sudo make install  Now fire R and install the interface package rjags $ R
...
> install.packages("rjags",dependencies=TRUE)


Now let's do something interesting (although pretty simple). Let's assume we have a stream of 1s and 0s with an unknown proportion of each one. From R we can generate such distribution with the command

> points <- floor(runif(1000)+.4)


that generates a distribution with roughly 40% of 1s and 60% of 0s. So, our stream consists of a sequence of 0s and 1s generated using the uniform(phi) distribution where the phi parameter equals 0.4.

If we don't know this parameter and we try to learn it, we can assume that this parameter has prior uniform in the range [0,1] and thus the model that describes this scenario in the Winbugs language is

model
{
phi ~ dunif(0,1);
y ~ dbin(phi,N);
}


In this model N and y are known, so we provide this information in order to estimate our unknown parameter phi. We create the model and query the resulting parameter distribution:

> result <- list(y=sum(points), N=1000)
> result
$y [1] 393$N
[1] 1000

> library(rjags)
> myModel <- jags.model("model.dat", result)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph Size: 5

Initializing model

> update(myModel, 1000)
|**************************************************| 100%
> x <- jags.samples(myModel, c('phi'), 1000)
|**************************************************| 100%
> x
$phi mcarray: [1] 0.3932681 Marginalizing over: iteration(1000),chain(1) >  So the inferred value of phi is 0.3932. One interesting thing in Bayesian statistics is that it does not estimate points, but probabilistic distributions over the parameters. We can see how the phi parameter was estimated by examining the Monte Carlo Chain and the distribution of the generated values during the simulation > chain <- as.mcmc.list(x$phi)
> plot(chain)


Where we can see that the values for phi in the chain were centered around the 0.4, the true parameter value.

Categories: Data Mining Tags:

## My working environment with Xmonad

Fire a terminal, fire another terminal and tail some logs, open your browser and point to the web page you are developing, open your IDE and open three or four tabs with the code you suspect is causing the bug, put some breakpoints, alt-tab to the first terminal start your system under test, connect your debugging IDE to your system, perform some operations to your browser, catch the breakpoint, switch back and forth the code tabs, etc… This daily routine common to most developers, involves grabbing the mouse, arranging some windows, and switching context continuously. This creates a cognitive overload and a lack of productivity because developers are doing tasks not directly related to the task at hand.

This is the reason I don’t use Gnome any more and I’ve switched to Xmonad a tiling window manager that can be controlled almost exclusively with the keyboard. With this fully configurable window manager, I can move, resize, minimize, arrange, customize all the visible windows, move windows between workspaces, all with my hands not leaving the keyboard.The only thing I have not been able to accomplish is having the UrgentHook working for Skype. The Linux version of Skype fails to set the WM_URGENT X11 event when a new chat opens, and if I’m not in that workspace I don’t get any notification besides the bell. Still thinking about a good workaround, any ideas?

Here is a screenshot of xmonad in action with  some applications in it,

If you plan to install Xmonad in your computer, use version 0.10 or superior because it solves some nasty problems with Java applications. In case it is not yet ready for your favourite distribution, follow these instructions. In order to have this configuration working, just write the following 3 files:

• .xsession with the programs you need to start when the session begins (network manager, batery manager, etc…). Put it in your $HOME dir. • .xmobarrc with the configuration of your handy textual status bar . Put it in your$HOME
• xmonad.hs with the configuration of the window manager itself. Put it under $HOME/.xmonad xmonad.hs is a pure Haskell file for configuring the window manager, no XML, not another fancy configuration language. Some comments to the configuration file • lines 23-26, send Thunderbird and Skype to their dedicated workspaces • line 29, name the workspaces • lines 32-50, define new key combinations, for navigating the tiling windows, send windows to background and toggle between the tiled arrangement and focusing the whole screen into one window • lines 52-55, define how I want the windows to be arranged. Basically, create a specific configuration for Skype in its dedicated workspace, and for the rest of workspaces, don’t hide the menu, allow navigation with the cursors and minimize unwanted windows. • lines 57-77, ensemble the main xmonad window manager with the desired configuration. Spawn the status bar, and append the predefined layouts, keybindings and window hooks. Redefine some keys and fool Java setting the name of the Window Manager to LG3D in order to avoid problems with focus. import XMonad import XMonad.Hooks.DynamicLog import XMonad.Hooks.ICCCMFocus import XMonad.Hooks.ManageDocks import XMonad.Util.Run(spawnPipe) import XMonad.Util.EZConfig(additionalKeys) import System.IO import XMonad.Hooks.ManageHelpers import XMonad.Hooks.UrgencyHook import XMonad.Hooks.SetWMName import XMonad.Layout.Minimize import XMonad.Layout.WindowNavigation import XMonad.Layout.ToggleLayouts import XMonad.Layout.IM as IM import XMonad.Layout.PerWorkspace import XMonad.Layout.Reflect import XMonad.Layout.Grid import Data.Ratio ((%)) import qualified Data.Map as M -- Send applications to their dedicated Workspace myManageHook = composeAll [ className =? "Skype" --> doShift "4:skype", className =? "Thunderbird" --> doShift "2:mail" ] -- Name the workspaces myWorkspaces = ["1:dev","2:mail","3:web","4:skype","5:media", "6:office"] ++ map show [7..9] -- Add new Keys newKeys x = M.union (keys defaultConfig x) (M.fromList (myKeys x)) myKeys conf@(XConfig {XMonad.modMask = modm}) = [ -- Minimize a window ((modm, xK_z), withFocused minimizeWindow) , ((modm .|. shiftMask, xK_z), sendMessage RestoreNextMinimizedWin ) -- Window navigation with cursors , ((modm, xK_Right), sendMessage$ Go R)
, ((modm,                 xK_Left ), sendMessage $Go L) , ((modm, xK_Up ), sendMessage$ Go U)
, ((modm,                 xK_Down ), sendMessage $Go D) , ((modm .|. controlMask, xK_Right), sendMessage$ Swap R)
, ((modm .|. controlMask, xK_Left ), sendMessage $Swap L) , ((modm .|. controlMask, xK_Up ), sendMessage$ Swap U)
, ((modm .|. controlMask, xK_Down ), sendMessage $Swap D) -- Togle Fullscreen , ((modm, xK_f ), sendMessage ToggleLayout) ] -- Define the default layout skypeLayout = IM.withIM (1%7) (IM.And (ClassName "Skype") (Role "MainWindow")) Grid normalLayout = windowNavigation$ minimize $avoidStruts$ onWorkspace "4:skype" skypeLayout $layoutHook defaultConfig myLayout = toggleLayouts (Full) normalLayout -- Main executable main = do xmproc <- spawnPipe "xmobar /home/cebrian/.xmobarrc" xmonad$ withUrgencyHook NoUrgencyHook $defaultConfig { manageHook = manageDocks <+> myManageHook <+> manageHook defaultConfig , keys = newKeys , workspaces = myWorkspaces , layoutHook = myLayout , logHook = takeTopFocus >> dynamicLogWithPP xmobarPP { ppOutput = hPutStrLn xmproc , ppTitle = xmobarColor "green" "" . shorten 50 , ppUrgent = xmobarColor "yellow" "red" . xmobarStrip } , modMask = mod4Mask -- Rebind Mod to the Windows key , terminal = "terminator" , startupHook = setWMName "LG3D" } additionalKeys [ ((controlMask .|. shiftMask, xK_l), spawn "xscreensaver-command -lock") , ((controlMask, xK_Print), spawn "sleep 0.2; scrot -s") , ((0, xK_Print), spawn "scrot") ]  .xsession #!/bin/bash xrdb -merge .Xresources # Configure xrandr for multiple monitors # External output may be "VGA" or "VGA-0" or "DVI-0" or "TMDS-1" EXTERNAL_OUTPUT="VGA-0" INTERNAL_OUTPUT="LVDS" # EXTERNAL_LOCATION may be one of: left, right, above, or below EXTERNAL_LOCATION="left" # In case I want to have both monitors on case "$EXTERNAL_LOCATION" in
left|LEFT)
EXTERNAL_LOCATION="--left-of $INTERNAL_OUTPUT" ;; right|RIGHT) EXTERNAL_LOCATION="--right-of$INTERNAL_OUTPUT"
;;
top|TOP|above|ABOVE)
EXTERNAL_LOCATION="--above $INTERNAL_OUTPUT" ;; bottom|BOTTOM|below|BELOW) EXTERNAL_LOCATION="--below$INTERNAL_OUTPUT"
;;
*)
EXTERNAL_LOCATION="--left-of $INTERNAL_OUTPUT" ;; esac xrandr | grep$EXTERNAL_OUTPUT | grep " connected "

if [ $? -eq 0 ]; then xrandr --output$INTERNAL_OUTPUT --off --output $EXTERNAL_OUTPUT --auto # Alternative command in case of trouble: # (sleep 2; xrandr --output$INTERNAL_OUTPUT --auto --output $EXTERNAL_OUTPUT --auto$EXTERNAL_LOCATION) &
else
xrandr --output $INTERNAL_OUTPUT --auto --output$EXTERNAL_OUTPUT --off
fi

trayer --edge top --align right --SetDockType true --SetPartialStrut true --expand true --width 15 --height 12 --transparent true --tint 0x000000 &

xscreensaver -no-splash &

# Allow nautilus to take care of plugin USB drives and Dropbox icons
nautilus --no-desktop -n &

if [ -x /usr/bin/nm-applet ] ; then
nm-applet --sm-disable &
fi

if [ -x /usr/bin/gnome-power-manager ] ; then
sleep 1
gnome-power-manager &
fi

/usr/bin/gnome-volume-control-applet &
dropbox start &



.xmobarrc

Config { font = "-*-Fixed-Bold-R-Normal-*-13-*-*-*-*-*-*-*"
, bgColor = "black"
, fgColor = "grey"
, position = TopW L 85
, commands = [ Run Cpu ["-L","3","-H","50","--normal","green","--high","red"] 10
, Run Memory ["-t","Mem: <usedratio>%"] 10
, Run Swap [] 10
, Run Date "%a %b %_d %Y %H:%M:%S" "date" 10
]
, sepChar = "%"
, alignSep = "}{"
, template = "%StdinReader% }{ %cpu% | %memory% * %swap%    <fc=#ee9a00>%date%</fc>"
}


[Desktop Entry]
Type=Application
Encoding=UTF-8
NoDisplay=true
X-GNOME-Autostart-Phase=WindowManager
X-GNOME-Provides=windowmanager
X-GNOME-Autostart-Notify=true


[Desktop Entry]
Encoding=UTF-8
Comment=Lightweight tiling window manager
Type=XSession


Categories: Uncategorized Tags:

## Maximal clique enumeration using C++0x and the STL

Lately I’ve started to take a look at how parallelism is being done in a pure functional language like Haskell and their related technologies Haskell in the Cloud and Data Parallel Haskell. As a proof of concept and in order to better learn those techniques, I want to parallelize the Bron-Kerbosch algorithm that returns the set of maximal cliques in a graph.The Bron-Kerbosch algorithm in pseudocode is something like this:

function bronKerbosch()
compsub = []
cand = V
not = []
cliqueEnumerate(compsub, cand, not)


and the cliqueEnumerate function in pseudo-code:

function cliqueEnumerate(compsub, cand, not)
if cand = [] then
if not = [] then
Output compsub
else
fixp = The vertex in cand that is connected to the greatest number of other vertices in cand
cur_v = fixp
while cur_v != NULL do
new_not = All vertices in not that are connected to cur_v
new_cand = All vertices in cand that are connected to cur_v
new_compsub = compsub + cur_v
cliqueEnumerate(new_compsub, new_cand, new_not)
not = not + cur_v
cand = cand - cur_v
if there is a vertex v in cand that is not connected to fixp then
cur_v = v
else
cur_v = NULL


This pseudocode is from the paper, A scalable, parallel algorithm for maximal clique enumeration

I have written a first version of this algorithm in Haskell but, as a baseline and because I wanted to test the new features of the new C++0x standard, I’ve written this algorithm in C++ making extensive use of the STL and the lambdas. I forgot how verbose C++ is but the addition of the keyword auto and the lambdas has somehow alleviated it a little.

void Graph::cliqueEnumerate(const vector<int>& compsub,
vector<int> cand,
vector<int> cnot,
vector<vector<int> >& result) const {

// Function that answer whether the node is conected
if(cand.empty()){
if(cnot.empty()){
// New clique found
result.push_back(compsub);
}
} else{
int fixp = findFixP(cand);
int cur_v = fixp;

while(cur_v != -1){
vector<int> new_not;
vector<int> new_cand;
vector<int> new_compsub;

// Auxiliar lambda function
auto isConected =[cur_v,this](int x) {
const vector<int>& edges = this->getEdges(x);
return find(edges.begin(), edges.end(), cur_v) != edges.end();
};

// Compose new vector
// Avoid performance bottlenecks by reserving memory before hand
new_compsub.reserve(compsub.size()+1);
new_not.reserve(cnot.size());
new_cand.reserve(cand.size());
copy_if(cnot.begin(),cnot.end(),back_inserter(new_not),isConected);
copy_if(cand.begin(),cand.end(),back_inserter(new_cand),isConected);
copy(compsub.begin(), compsub.end(), back_inserter(new_compsub));
new_compsub.push_back(cur_v);

// Recursive call
cliqueEnumerate(new_compsub, new_cand, new_not, result);

// Generate new cnot and cand for the loop
cnot.push_back(cur_v);
cand.erase(find(cand.begin(),cand.end(),cur_v));

// Last check
auto v = find_if(cand.begin(),
cand.end(),
[fixp,this](int x) {
const vector<int>& edges = this->getEdges(x);
return find(edges.begin(), edges.end(), fixp) == edges.end();
});

// Obtain new cur_v value
if(v != cand.end()) cur_v = *v;
else break;
}
}
}


and the helper function findFixP is:

int Graph::findFixP(vector<int> cand) const {
vector<int> connections;
connections.resize(cand.size());

// This is necessary for the set_intersection function
sort(cand.begin(),cand.end());

// Auxiliar lambda function
auto intersection = [&cand, this](int x) -> int {
const vector<int>& x_edges = this->getEdges(x);
vector<int> intersection;

set_intersection(x_edges.begin(), x_edges.end(),
cand.begin(), cand.end(),
back_inserter(intersection));
return intersection.size();
};

// Create an auxiliar vector with the intersection sizes
transform(cand.begin(),cand.end(),connections.begin(),intersection);

// Find the maximum size and return the corresponding edge
vector<int>::const_iterator it1, it2,itmax;
int max = -1;
itmax = cand.end();
for(it1=cand.begin(),it2=connections.begin(); it1!=cand.end(); ++it1,++it2){
if(max < *it2){
max = *it2;
itmax = it1;
}
}
if(itmax == cand.end()) return -1;
else return *itmax;
}
}


For this function my first attempt was to write it using the std::max_element function, but I was worried that since the function we pass isn’t a transformation of the data but a comparison function (less), I was worried that on each comparison the set_intersection would be computed redundantly on each step.

There can be, for sure, room for improvement (any C++ guru in the audience?), but I’m pretty satisfied with the obtained implementation. It reads almost as the pseudocode. I think this is because I first wrote the Haskell version and the C++ has the functional flavor in it. Would I write first the C++ version and there would be for sure lots of nasty loops and array indexes.

Categories: Programming Tags: