CSV parser in C++ with Boost and Template Metaprogramming

December 28th, 2011 No comments

One common situation when analyzing big chunks of data is to parse big CSV files with a record structure on each line. That is, all lines conform to a fixed schema where each line represents a record and each record has several columns of different types. Your objective is to parse and fill a data structure like this

struct foo {
    int a;
    std::string b;
    double c;
};

for each record. All the information needed for such parsing is in this data structure so I wondered whether I could write a program when you pass that type information and the parsing is done automatically for you. This is my first attempt at getting a flexible and fast CSV parser in C++. It is hosted in Github.

Design

When parsing CSV files the common usage pattern is to iterate through each line and perform a given action on each record. No need for a big CSV container or something similar, so the best approach is to write a class that acts as an iterator. When derreferencing the iterator it will return the provided data structure filled with parsed data from the current line.

An iterator is associated to a container but in this case, it will be constructed accepting an std::istream representing the CSV file. By accepting this istream we will be able to parse strings using the std::stringstream class, regular files, or compressed files using the Boost Iostream library.

The iterator must have all the common operators for it to interoperate with the STL algorithms seamlessly. The empty constructor will mark the iterator’s end-of-range position that coincides with the end of file. A typical use case will be to have some code like this:

#include <fstream>
#include <boost/tuple/tuple.hpp>
#include <csv_iterator.hpp>

using namespace boost::tuples;
typedef tuple<int,std::string,double> record;

void myFunc(const record& value){
    // Your code here
}

int main(int argc, const char *argv[])
{
    // Example of csv_iterator usage
    std::ifstream in("myCsvFile.csv");
    csv::iterator<record> it(in);

    // Use the iterator in your algorithms.
    std::for_each(it, csv::iterator<record>(), myFunc);

    return 0;
}

for parsing CSV files like this:

1,hola,3.14
2,adios,2.71

Implementation

For obtaining the values on each line, the library uses the Boost Tokenizer library. From a string you create a tokenizer that splits the input string by the character delimiter that by default is the comma. It also takes care of escaping characters. Accessing the different tokens is granted by the tokenizer by providing a token iterator.

using namespace boost;
typedef tokenizer<escaped_list_separator<char> > Tokens;
Tokens tokens(currentLine);
// Tokens can be accessed using tokens.begin() and tokens.end() iterators

Once we have the strings representing different values we have to parse and convert them into a type. Bad data format exceptions happen here and can be spotted earlier. For parsing using an unified approach the library uses the Boost Lexical Cast Library. This library provides a uniform and portable interface for doing text conversions

int myInt = boost::lexical_cast<int>("5");
double myDouble = boost::lexical_cast<double>("3.14");

For the record data structure the library uses the Boost Tuple Library that provides a Plain “old” Data Structure for storing the record fields. Moreover this class provides some template infrastructure that helps in the metaprogramming trick that follows.

Template metaprogamming

For our library the number of columns and the datatype is implicit in the record type. Our algorithm for parsing is basic, depending on the field type, parse the Nth string accordingly, and assign the result to the Nth field. Repeat for all the record fields. This parametric polymorphism must be combined with the dynamic access of the string tokenization. The former can be obtained in C++ using Template metaprogramming. The code that makes the trick is this one

template<class Tuple, int N >
    struct helper {
        static void fill(Tuple& tuple, strIt& it, strIt& end){
            using namespace boost::tuples;
            typedef typename element<length<Tuple>::value-N-1,Tuple>::type value_type;
            checkIteratorRange(it,end);
            get<length<Tuple>::value-N-1>(tuple) = boost::lexical_cast<value_type>(it->c_str());
            ++it;
            helper<Tuple,N-1>::fill(tuple,it,end);
        }
    };

template<class Tuple>
    struct helper<Tuple, 0> {
        static void fill(Tuple& tuple, strIt& it, strIt& end){
            using namespace boost::tuples;
            typedef typename boost::tuples::element<length<Tuple>::value-1,Tuple>::type value_type;
            checkIteratorRange(it,end);
            boost::tuples::get<length<Tuple>::value-1>(tuple) = boost::lexical_cast<value_type>(it->c_str());
            ++it;
        };
    };

template<class Tuple>
    struct filler {
       static void fill(Tuple& tuple, strIt&& it,strIt&& end){
            checkIteratorRange(it,end);
            helper<Tuple, boost::tuples::length<Tuple>::value-1>::fill(tuple,it,end);
        }
    };

Yes, C++ syntax sucks!! But basically what we are doing here is a common pattern of functional programming, tail recursion. We define structures that contain static functions. The filler structure just initializes the recursive call by instantiating an instance of the helper paremeterized structure setting the length recursion to the number of fields in the tuple. This structure has to be specialized for the 0 value in order for the recursion to stop. All this functionality is done behind the curtain by the compiler (compilation time increases when using template metaprogramming) but the generated code will be something very similar to this pseudocode:

boost::tuples::get<0>(tuple) = (casting_here)*it;
++it;
boost::tuples::get<1>(tuple) = (casting_here)*it;
++it;
boost::tuples::get<2>(tuple) = (casting_here)*it;
++it;

almost identical to the code we would have written by hand. The compiler is the one who knows how many fields our structure has at compile time and generates as many efficient instructions as needed.

Conclusion

I wanted to stretch my programming muscles by coding a generic, flexible and fast CSV parser in C++ using template metaprogramming. The generic and flexible parts of the task have been accomplished but not the performance objectives. Although the library is fast enough for most tasks, it isn’t in a scenario of Big Data parsing big files. The tokenizer iterator is incurring a performance penalty each time I try to derreference it, since it creates and allocates memory for a std::string each time we invoke *it. This memory allocation is a performance drain doing useless work because we need this information only for parsing and getting a value, but the string is discarded thereafter. It would be better to perform an in-place parsing using the string allocated when reading the lines of the file. To that end it will be enlightening in the future to try this same exercise with more performant string libraries like the C++ String Toolkit and see the differences in performance.

Categories: Programming Tags: , , ,

Implementing LDA

November 28th, 2011 No comments

Lately I was playing with Latent Dirichlet Allocation (LDA) for a project at work. If for whatever reason you need to implement such algorithm perhaps you will save some time reading the walkthrough I did.

First you must be sure that LDA is the algorithm you are looking for. From a corpus of documents you will get K lists with words from your documents in them with a number assigned to each word denoting the relevance of the given word in the lists. Each list represents a topic, and that would be your topic description, no fancy words like “Computers”, “Biology” or “Life meaning”, just a set of words that a human must interpret. You could always assign a single name by picking the most prominent word in the list or treating the list as a valued vector and comparing it against a canonical topic description. So take a look at the first examples in this presentation and get inspired.

OK so you need some code to test how this method behaves with your particular data. The first thing to try is the topicmodels package from the R   statistical software package. This can give you an idea of the method and try to use it in a more serious Java application by means of the Mallet library.

But say that you need to create your own implementation because Java horrifies you or because you need a parallel version or whatever the reason. The first thing you have to do is to choose the inference method of your model between variational methods or gibbs sampling. This post will give you some ideas for picking the right method for your particular problem. The original papers picked the variational approach but I went through the Gibbs sampling method because I found this paper where all the mathematical derivations are nailed down. That way I was able to fully understand the method and at the same time being sure that my implementation was right and sound.  If you need more guidance, take a look at this simple implementation for getting an idea of the main functions and data structures you’ll have to code.

Once you have your code written you will have to check whether it is correct or not. The example in this paper using pixel positions and pixel intensities instead of words and word counts is very illustrating and will show visually the correctness of your implementation. Once you have your algorithm up and running perhaps you want to scale it up to more machines, so you could benefit from reading this paper  and taking also a look at this blog post from Alex Smola and their distributed implementation of LDA on Github.

Happy coding!!!

Getting started with JAGS for Bayesian Modelling

September 13th, 2011 No comments

In the past if you wanted to do some Bayesian Modelling using a Gibbs sampler you had to rely on Winbugs but it isn’t open source and it only runs in Windows. The JAGS project started as a full feature alternative for the Linux platform. Here are some instructions for getting started

First install the required dependencies. In my Ubuntu 11.04, is something like:

sudo apt-get install gfortran liblapack-dev libltdl-dev r-base-core

For Ubuntu 11.04 you have to install JAGS from sources, but it seems that this version will be packaged in the next Ubuntu release. Download the software from here and install.

tar xvfz JAGS-3.1.0.tar.gz
cd JAGS-3.1.0
./configure
make
sudo make install

Now fire R and install the interface package rjags

$ R
...
> install.package("rjags",dependencies=TRUE)

Now let's do something interesting (although pretty simple). Let's assume we have a stream of 1s and 0s with an unknown proportion of each one. From R we can generate such distribution with the command

> points <- floor(runif(1000)+.4)

that generates a distribution with roughly 40% of 1s and 60% of 0s. So, our stream consists of a sequence of 0s and 1s generated using the uniform(phi) distribution where the phi parameter equals 0.4.

If we don't know this parameter and we try to learn it, we can assume that this parameter has prior uniform in the range [0,1] and thus the model that describes this scenario in the Winbugs language is

model
{
    phi ~ dunif(0,1);
    y ~ dbin(phi,N);
}

In this model N and y are known, so we provide this information in order to estimate our unknown parameter phi. We create the model and query the resulting parameter distribution:

> result <- list(y=sum(points), N=1000)
> result
$y
[1] 393

$N
[1] 1000

> library(rjags)
Loading required package: coda
Loading required package: lattice
linking to JAGS 3.1.0
module basemod loaded
module bugs loaded
> myModel <- jags.model("model.dat", result)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 5

Initializing model

> update(myModel, 1000)
  |**************************************************| 100%
> x <- jags.samples(myModel, c('phi'), 1000)
  |**************************************************| 100%
> x
$phi
mcarray:
[1] 0.3932681

Marginalizing over: iteration(1000),chain(1) 

>

So the inferred value of phi is 0.3932. One interesting thing in Bayesian statistics is that it does not estimate points, but probabilistic distributions over the parameters. We can see how the phi parameter was estimated by examining the Monte Carlo Chain and the distribution of the generated values during the simulation

> chain <- as.mcmc.list(x$phi)
> plot(chain)


Where we can see that the values for phi in the chain were centered around the 0.4, the true parameter value.

Categories: Data Mining Tags: , , ,

My working environment with Xmonad

September 5th, 2011 3 comments

Fire a terminal, fire another terminal and tail some logs, open your browser and point to the web page you are developing, open your IDE and open three or four tabs with the code you suspect is causing the bug, put some breakpoints, alt-tab to the first terminal start your system under test, connect your debugging IDE to your system, perform some operations to your browser, catch the breakpoint, switch back and forth the code tabs, etc… This daily routine common to most developers, involves grabbing the mouse, arranging some windows, and switching context continuously. This creates a cognitive overload and a lack of productivity because developers are doing tasks not directly related to the task at hand.

This is the reason I don’t use Gnome any more and I’ve switched to Xmonad a tiling window manager that can be controlled almost exclusively with the keyboard. With this fully configurable window manager, I can move, resize, minimize, arrange, customize all the visible windows, move windows between workspaces, all with my hands not leaving the keyboard.The only thing I have not been able to accomplish is having the UrgentHook working for Skype. The Linux version of Skype fails to set the WM_URGENT X11 event when a new chat opens, and if I’m not in that workspace I don’t get any notification besides the bell. Still thinking about a good workaround, any ideas?

Here is a screenshot of xmonad in action with  some applications in it,

If you plan to install Xmonad in your computer, use version 0.10 or superior because it solves some nasty problems with Java applications. In case it is not yet ready for your favourite distribution, follow these instructions. In order to have this configuration working, just write the following 3 files:

  • .xsession with the programs you need to start when the session begins (network manager, batery manager, etc…). Put it in your $HOME dir.
  • .xmobarrc with the configuration of your handy textual status bar . Put it in your $HOME
  • xmonad.hs with the configuration of the window manager itself. Put it under $HOME/.xmonad

xmonad.hs is a pure Haskell file for configuring the window manager, no XML, not another fancy configuration language. Some comments to the configuration file

  • lines 23-26, send Thunderbird and Skype to their dedicated workspaces
  • line 29, name the workspaces
  • lines 32-50, define new key combinations, for navigating the tiling windows, send windows to background and toggle between the tiled arrangement and focusing the whole screen into one window
  • lines 52-55, define how I want the windows to be arranged. Basically, create a specific configuration for Skype in its dedicated workspace, and for the rest of workspaces, don’t hide the menu, allow navigation with the cursors and minimize unwanted windows.
  • lines 57-77, ensemble the main xmonad window manager with the desired configuration. Spawn the status bar, and append the predefined layouts, keybindings and window hooks. Redefine some keys and fool Java setting the name of the Window Manager to LG3D in order to avoid problems with focus.
import XMonad
import XMonad.Hooks.DynamicLog
import XMonad.Hooks.ICCCMFocus
import XMonad.Hooks.ManageDocks
import XMonad.Util.Run(spawnPipe)
import XMonad.Util.EZConfig(additionalKeys)
import System.IO
import XMonad.Hooks.ManageHelpers
import XMonad.Hooks.UrgencyHook
import XMonad.Hooks.SetWMName
import XMonad.Layout.Minimize
import XMonad.Layout.WindowNavigation
import XMonad.Layout.ToggleLayouts
import XMonad.Layout.IM as IM
import XMonad.Layout.PerWorkspace
import XMonad.Layout.Reflect
import XMonad.Layout.Grid
import Data.Ratio ((%))

import qualified Data.Map as M

-- Send applications to their dedicated Workspace
myManageHook = composeAll
                [ className =? "Skype"         --> doShift "4:skype",
                  className =? "Thunderbird"   --> doShift "2:mail"
                ]

-- Name the workspaces
myWorkspaces = ["1:dev","2:mail","3:web","4:skype","5:media", "6:office"] ++ map show [7..9]

-- Add new Keys
newKeys x = M.union (keys defaultConfig x) (M.fromList (myKeys x))

myKeys conf@(XConfig {XMonad.modMask = modm}) =
              [
              -- Minimize a window
                ((modm, xK_z),               withFocused minimizeWindow)
              , ((modm .|. shiftMask, xK_z), sendMessage RestoreNextMinimizedWin  )
              -- Window navigation with cursors
              , ((modm,                 xK_Right), sendMessage $ Go R)
              , ((modm,                 xK_Left ), sendMessage $ Go L)
              , ((modm,                 xK_Up   ), sendMessage $ Go U)
              , ((modm,                 xK_Down ), sendMessage $ Go D)
              , ((modm .|. controlMask, xK_Right), sendMessage $ Swap R)
              , ((modm .|. controlMask, xK_Left ), sendMessage $ Swap L)
              , ((modm .|. controlMask, xK_Up   ), sendMessage $ Swap U)
              , ((modm .|. controlMask, xK_Down ), sendMessage $ Swap D)
              -- Togle Fullscreen
              , ((modm,                 xK_f    ), sendMessage ToggleLayout)
              ]

-- Define the default layout
skypeLayout = IM.withIM (1%7) (IM.And (ClassName "Skype")  (Role "MainWindow")) Grid
normalLayout = windowNavigation $ minimize $ avoidStruts $ onWorkspace "4:skype" skypeLayout $ layoutHook defaultConfig
myLayout = toggleLayouts (Full) normalLayout

-- Main executable
main = do
    xmproc <- spawnPipe "xmobar /home/cebrian/.xmobarrc"
    xmonad $ withUrgencyHook NoUrgencyHook $ defaultConfig
        { manageHook = manageDocks <+> myManageHook <+> manageHook defaultConfig
        , keys = newKeys
        , workspaces = myWorkspaces
        , layoutHook = myLayout
        , logHook = takeTopFocus >> dynamicLogWithPP xmobarPP
                        { ppOutput = hPutStrLn xmproc
                        , ppTitle = xmobarColor "green" "" . shorten 50
                        , ppUrgent = xmobarColor "yellow" "red" . xmobarStrip
                        }
        , modMask = mod4Mask     -- Rebind Mod to the Windows key
        , terminal = "terminator"
        , startupHook = setWMName "LG3D"
        } `additionalKeys`
        [ ((controlMask .|. shiftMask, xK_l), spawn "xscreensaver-command -lock")
        , ((controlMask, xK_Print), spawn "sleep 0.2; scrot -s")
        , ((0, xK_Print), spawn "scrot")
        ]

.xsession

#!/bin/bash

xrdb -merge .Xresources

# Configure xrandr for multiple monitors
# External output may be "VGA" or "VGA-0" or "DVI-0" or "TMDS-1"
EXTERNAL_OUTPUT="VGA-0"
INTERNAL_OUTPUT="LVDS"
# EXTERNAL_LOCATION may be one of: left, right, above, or below
EXTERNAL_LOCATION="left"

# In case I want to have both monitors on
case "$EXTERNAL_LOCATION" in
       left|LEFT)
               EXTERNAL_LOCATION="--left-of $INTERNAL_OUTPUT"
               ;;
       right|RIGHT)
               EXTERNAL_LOCATION="--right-of $INTERNAL_OUTPUT"
               ;;
       top|TOP|above|ABOVE)
               EXTERNAL_LOCATION="--above $INTERNAL_OUTPUT"
               ;;
       bottom|BOTTOM|below|BELOW)
               EXTERNAL_LOCATION="--below $INTERNAL_OUTPUT"
               ;;
       *)
               EXTERNAL_LOCATION="--left-of $INTERNAL_OUTPUT"
               ;;
esac
xrandr | grep $EXTERNAL_OUTPUT | grep " connected "

if [ $? -eq 0 ]; then
    xrandr --output $INTERNAL_OUTPUT --off --output $EXTERNAL_OUTPUT --auto
    # Alternative command in case of trouble:
    # (sleep 2; xrandr --output $INTERNAL_OUTPUT --auto --output $EXTERNAL_OUTPUT --auto $EXTERNAL_LOCATION) &
else
    xrandr --output $INTERNAL_OUTPUT --auto --output $EXTERNAL_OUTPUT --off
fi

trayer --edge top --align right --SetDockType true --SetPartialStrut true --expand true --width 15 --height 12 --transparent true --tint 0x000000 &

xscreensaver -no-splash &

# Allow nautilus to take care of plugin USB drives and Dropbox icons
nautilus --no-desktop -n &

if [ -x /usr/bin/nm-applet ] ; then
   nm-applet --sm-disable &
fi

if [ -x /usr/bin/gnome-power-manager ] ; then
   sleep 1
   gnome-power-manager &
fi

/usr/bin/gnome-volume-control-applet &
dropbox start &

exec /home/cebrian/.cabal/bin/xmonad

.xmobarrc

Config { font = "-*-Fixed-Bold-R-Normal-*-13-*-*-*-*-*-*-*"
       , bgColor = "black"
       , fgColor = "grey"
       , position = TopW L 85
       , commands = [ Run Cpu ["-L","3","-H","50","--normal","green","--high","red"] 10
                    , Run Memory ["-t","Mem: <usedratio>%"] 10
                    , Run Swap [] 10
                    , Run Date "%a %b %_d %Y %H:%M:%S" "date" 10
                    , Run StdinReader
                    ]
       , sepChar = "%"
       , alignSep = "}{"
       , template = "%StdinReader% }{ %cpu% | %memory% * %swap%    <fc=#ee9a00>%date%</fc>"
       }

Categories: Uncategorized Tags: , , ,

Maximal clique enumeration using C++0x and the STL

August 29th, 2011 No comments

Lately I’ve started to take a look at how parallelism is being done in a pure functional language like Haskell and their related technologies Haskell in the Cloud and Data Parallel Haskell. As a proof of concept and in order to better learn those techniques, I want to parallelize the Bron-Kerbosch algorithm that returns the set of maximal cliques in a graph.The Bron-Kerbosch algorithm in pseudocode is something like this:

function bronKerbosch()
compsub = []
cand = V
not = []
cliqueEnumerate(compsub, cand, not)

and the cliqueEnumerate function in pseudo-code:

function cliqueEnumerate(compsub, cand, not)
if cand = [] then
    if not = [] then
        Output compsub
else
    fixp = The vertex in cand that is connected to the greatest number of other vertices in cand
    cur_v = fixp
    while cur_v != NULL do
        new_not = All vertices in not that are connected to cur_v
        new_cand = All vertices in cand that are connected to cur_v
        new_compsub = compsub + cur_v
        cliqueEnumerate(new_compsub, new_cand, new_not)
        not = not + cur_v
        cand = cand - cur_v
        if there is a vertex v in cand that is not connected to fixp then
           cur_v = v
        else
           cur_v = NULL

This pseudocode is from the paper, A scalable, parallel algorithm for maximal clique enumeration


I have written a first version of this algorithm in Haskell but, as a baseline and because I wanted to test the new features of the new C++0x standard, I’ve written this algorithm in C++ making extensive use of the STL and the lambdas. I forgot how verbose C++ is but the addition of the keyword auto and the lambdas has somehow alleviated it a little.

void Graph::cliqueEnumerate(const vector<int>& compsub,
                     vector<int> cand,
                     vector<int> cnot,
                     vector<vector<int> >& result) const {

    // Function that answer whether the node is conected
    if(cand.empty()){
        if(cnot.empty()){
            // New clique found
            result.push_back(compsub);
        }
    } else{
        int fixp = findFixP(cand);
        int cur_v = fixp;

        while(cur_v != -1){
            vector<int> new_not;
            vector<int> new_cand;
            vector<int> new_compsub;

            // Auxiliar lambda function
            auto isConected =[cur_v,this](int x) {
                const vector<int>& edges = this->getEdges(x);
                return find(edges.begin(), edges.end(), cur_v) != edges.end();
            }; 

            // Compose new vector
            // Avoid performance bottlenecks by reserving memory before hand
            new_compsub.reserve(compsub.size()+1);
            new_not.reserve(cnot.size());
            new_cand.reserve(cand.size());
            copy_if(cnot.begin(),cnot.end(),back_inserter(new_not),isConected);
            copy_if(cand.begin(),cand.end(),back_inserter(new_cand),isConected);
            copy(compsub.begin(), compsub.end(), back_inserter(new_compsub));
            new_compsub.push_back(cur_v);

            // Recursive call
            cliqueEnumerate(new_compsub, new_cand, new_not, result);

            // Generate new cnot and cand for the loop
            cnot.push_back(cur_v);
            cand.erase(find(cand.begin(),cand.end(),cur_v));

            // Last check
            auto v = find_if(cand.begin(),
                             cand.end(),
                             [fixp,this](int x) {
                                const vector<int>& edges = this->getEdges(x);
                                return find(edges.begin(), edges.end(), fixp) == edges.end();
                             });

            // Obtain new cur_v value
            if(v != cand.end()) cur_v = *v;
            else break;
        }
    }
}

and the helper function findFixP is:

int Graph::findFixP(vector<int> cand) const {
    vector<int> connections;
    connections.resize(cand.size());

    // This is necessary for the set_intersection function
    sort(cand.begin(),cand.end());

    // Auxiliar lambda function
    auto intersection = [&cand, this](int x) -> int {
        const vector<int>& x_edges = this->getEdges(x);
        vector<int> intersection;

        set_intersection(x_edges.begin(), x_edges.end(),
                         cand.begin(), cand.end(),
                         back_inserter(intersection));
        return intersection.size();
    };

    // Create an auxiliar vector with the intersection sizes
    transform(cand.begin(),cand.end(),connections.begin(),intersection);

    // Find the maximum size and return the corresponding edge
    vector<int>::const_iterator it1, it2,itmax;
    int max = -1;
    itmax = cand.end();
    for(it1=cand.begin(),it2=connections.begin(); it1!=cand.end(); ++it1,++it2){
        if(max < *it2){
            max = *it2;
            itmax = it1;
        }
    }
    if(itmax == cand.end()) return -1;
    else return *itmax;
    }
}

For this function my first attempt was to write it using the std::max_element function, but I was worried that since the function we pass isn’t a transformation of the data but a comparison function (less), I was worried that on each comparison the set_intersection would be computed redundantly on each step.

There can be, for sure, room for improvement (any C++ guru in the audience?), but I’m pretty satisfied with the obtained implementation. It reads almost as the pseudocode. I think this is because I first wrote the Haskell version and the C++ has the functional flavor in it. Would I write first the C++ version and there would be for sure lots of nasty loops and array indexes.

Categories: Programming Tags: , ,

Comparison of IDEs for Scala development

May 16th, 2011 1 comment

Although all my coding is done using Vim for editing and console tools for compiling, IDEs are the best alternative for other tasks like debugging or refactoring code. Since I’m exploring the Scala ecosystem I’ve done some research about which IDE best suits my coding requirements. Those are:

  • Unobtrusive SBT integration. As I said, I will use the IDE for debugging from time to time, and thus it should not interfere with the command line + Vim ecosystem.
  • The IDE must mix Java and Scala code in the project without any problem
  • Enable debugging of Scala applications
  • Provide refactoring of Scala code

The IDEs under test will be Eclipse, Netbeans and IntelliJ

Eclipse Helios (3.6.2)

After downloading Eclipse, the first thing to do is to install the Scala plugin and the SBT plugin. The Scala plugin is installed into the IDE itself but the SBT plugin has to be added into the SBT plugins section in project/plugins/YourProjectPlugins.scala. Putting IDE information in my project’s build instructions is something I don’t like at all. The plugin needs that you add the trait Eclipsify to your project definition, so this goes against the “Unobtrusive SBT integration”. In summary, the installation needs that you put

 lazy val sbtEclipsify = "de.element34" % "sbt-eclipsify" % "0.4.0"

in your project/plugins/YourProjectPlugins.scala file and to add the import and mix the trait in your project/build/TestProject.scala file

 import sbt._
import de.element34.sbteclipsify._

class TestProject(info: ProjectInfo) extends DefaultProject(info)
                                     with Eclipsify
{
...
}

In order to create the Eclipse project files, just fire sbt, and perform

sbt
> update
> eclipse

Now we can open the project and it looks like:

Thinks that I miss:

  • No console integration with SBT
  • Test are not recognized as such and thus, if you are debugging your unit tests you are out of luck.
  • Refactoring capabilities are very much limited (I was not able to find the simplest one, rename a class)

Netbeans (7.0)

The first thing is to install the Scala support for Netbeans grabbing the plugin from here. The SBT plugin for Netbeans includes a processor that allows you to create project skeletons within the sbt console. But the general functioning of the plugin is similar to the Eclipse one, you have to modify your project definition for running it. Follow instructions in Running the plugin.

The project looks like this in Netbeans.

Issues with Netbeans:

  • Netbeans uses internally SBT for all the compilation tasks, so this is a plus.
  • Although you can run SBT commands with the console, they are detached from the whole IDE. For instance, you can run the test phase but the IDE won’t stop in the breakpoints.
  • And the other way round does not work either. If you Right-click in a test, Netbeans complains that some classes can not be found.
  • Refactoring is full supported and it can perform usual code modifications. Great!!

IntelliJIDEA (10.0.3 Community Edition)

Again, first install the IDEA Scala plugin (can be downloaded going to File->Settings->Plugins). The SBT plugin that generates an IDEA project from our SBT project is better than the previous ones. Although it can be used as a normal SBT plugin in case all our team were using the IntelliJ IDE, its main advantage is that it can be used as a SBT processor. SBT processors are complementary to SBT plugins, but they are invoked on a per user basis instead of a per project like plugins do. They are thus unobtrusive and leaves no IDE traces in our project definition. Just write


sbt

> *sbtIdeaRepo at http://mpeltonen.github.com/maven/
> *idea is com.github.mpeltonen sbt-idea-processor 0.4.0

> update

> idea

Now we have our project files in the directory. Screenshots of the IntelliJ code editing panels and debugging capabilities.

So in general IntelliJ has all what I was looking for

  • Unobtrusive SBT support by means of an SBT processor
  • Coding and refactoring capabilities for the Scala and Java language
  • Debugging of full Scala projects or just one unit test
  • Full SBT console integration in the IDE.

Conclusion

So, the winner is clearly the IntelliJ IDE and a second place for Netbeans. I’ve never liked Eclipse and this time it was no exception.

Just one caveat. As I said I will use the IDE just a 10 percent of my time and thus this research has not been exhaustive. I fired up the three IDEs, followed basic instructions for Scala and SBT support and when things did not work as expected I’ve done a minimal work trying to fix it. It does not pay for me. So, If you think that I was wrong in one of the IDEs or that just a little tweak could recover one of the evaluations, I will be glad hearing your comments.

Categories: Programming Tags: , , , , ,

Simple Scala project step-by-step

May 2nd, 2011 7 comments

Updated!! To Scala 2.9.0 (August 24th 2011)

Recently I have started a project that will be coded in Scala. Here I leave some instructions for setting up the the scala project that uses sbt for project management and creates unit tests with scalatests. This entry follows the structure of this post but introduces some changes and improvements.

  1. Install sbt
  2. Create a new project
    $ mkdir test-project
    $ cd test-project/
    $ sbt
    Project does not exist, create new project? (y/N/s) y
    Name: TestProject
    Organization: com.example
    Version [1.0]:
    Scala version [2.7.7]: 2.9.0
    sbt version [0.7.4]: 0.7.7
    
  3. First we need to install a sbt plugin that collects test results and creates a xUnit report for using in a Continuous Integration Server like Jenkins. Details of that plugin can be found here. Create the file project/plugins/TestProjectPlugins.scala that instructs sbt to download the junit_xml_listener plugin.
    import sbt._
    
    class TestProjectPlugins(info: ProjectInfo) extends PluginDefinition(info) {
        val repo = "Christoph's Maven Repo" at "http://maven.henkelmann.eu/"
        val junitXml = "eu.henkelmann" % "junit_xml_listener" % "0.2"
    }
    
  4. Once we have the configuration for getting the plugin, create the project dependencies and include the new listener into the available ones by overriding the defaults. So, create a build configuration file in project/build/TestProject.scala like:
    import sbt._
    import eu.henkelmann.sbt.JUnitXmlTestsListener
    
    class TestProject(info: ProjectInfo) extends DefaultProject(info) {
       val scalatest = "org.scalatest" %% "scalatest" % "1.6.1" % "test"
    
       //create a listener that writes to the normal output directory
       def junitXmlListener: TestReportListener = new JUnitXmlTestsListener(outputPath.toString)
       //add the new listener to the already configured ones
       override def testListeners: Seq[TestReportListener] = super.testListeners ++ Seq(junitXmlListener)
    }
    
  5. Now, we have the project setup in place. Let’s write some classes and tests in order to check that everything is OK. First create the class under test in src/main/scala/Calc.scala
    package com.example
    
    object Calc {
        def add(x:Int, y:Int) : Int = x + y
    }
    
  6. Now, create a test in src/test/scala/CaltTest.scala that assures that the class is working as expected
    package com.example.test
    
    import com.example.Calc
    import org.scalatest.Suite
    
    class CalcTest extends Suite {
    
        def testAddition() {
            assert(4 === Calc.add(2,2))
        }
    }
    
  7. Now just test that tests pass by doing
    sbt clean-plugins
    sbt clean
    sbt update
    sbt test
    

    you should see something like this:

    $ sbt test
    [info] Building project test 1.0 against Scala 2.9.0
    [info]    using TestProject with sbt 0.7.7 and Scala 2.7.7
    [info]
    [info] == compile ==
    [info]   Source analysis: 0 new/modified, 0 indirectly invalidated, 0 removed.
    [info] Compiling main sources...
    [info] Nothing to compile.
    [info]   Post-analysis: 3 classes.
    [info] == compile ==
    [info]
    [info] == test-compile ==
    [info]   Source analysis: 0 new/modified, 0 indirectly invalidated, 0 removed.
    [info] Compiling test sources...
    [info] Nothing to compile.
    [info]   Post-analysis: 1 classes.
    [info] == test-compile ==
    [info]
    [info] == copy-test-resources ==
    [info] == copy-test-resources ==
    [info]
    [info] == copy-resources ==
    [info] == copy-resources ==
    [info]
    [info] == test-start ==
    [info] == test-start ==
    [info]
    [info] == com.example.test.CalcTest ==
    [info] CalcTest:
    [info] - testAddition
    [info] - testAdditionWithJavaObject
    [info] == com.example.test.CalcTest ==
    [info]
    [info] == test-complete ==
    [info] == test-complete ==
    [info]
    [info] == Test cleanup 1 ==
    [info] Deleting directory /tmp/sbt_b7e36ab4
    [info] == Test cleanup 1 ==
    [info]
    [info] == test-finish ==
    [info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0
    [info]
    [info] All tests PASSED.
    [info] == test-finish ==
    [info]
    [info] == test-cleanup ==
    [info] == test-cleanup ==
    [info]
    [info] == test ==
    [info] == test ==
    [success] Successful.
    [info]
    [info] Total time: 1 s, completed 24-ago-2011 15:40:26
    [info]
    [info] Total session time: 2 s, completed 24-ago-2011 15:40:26
    [success] Build completed successfully.
    

The test reports are located in target/scala_*/test-reports/*.xml. I have setup a Mercurial project with all this code for fast start up, check it out here.

Categories: Programming Tags: , , ,

People I will closely follow in 2011

November 29th, 2010 No comments

No one works in isolation. Everybody is looking for inspiration, ideas or exemplifying careers in his field. Specially in the technological field, we stand on the shoulders of giants. Those who follow are my giants for the next year, they are programmers, researchers or professors and they are working on one field of interest to me. Here they are:

Daniel Lemire

Daniel Lemire is a Canadian university professor working in the field of Recommender Systems. I first heard about him when researching high performance recommender systems I stumbled upon his simple yet effective Slope One recommender algorithm. However, lately he has been actively discussing in his blog about the inner workings of the scientific community, the peer review process and the validity of the University model as the best way to disseminate scientific knowledge in an age with zero costs for accessing information. I find those posts very stimulating.

David MacKay

David MacKay is a professor at the Inference Group in the University of Cambridge. His book about Information Theory is one of the best written ones I have read. It exposed me to Gaussian Processes in a clear and understandable way for the first time. But not only for his research quality, David is remarkable because is a professor that does not live in his ivory tower. He is able to apply his research to help the impaired with the Dasher project. Data mining and information theory for the greater good. And as an extra bonus his interest on tackling the global warming from a scientific point of view is very inspiring and necessary in an age when corporations dictate agenda. And all his publications freely accessible to everyone.

Michael I. Jordan

Michael I. Jordan is a difficult search term in the Internet :) . He is a professor in Berkeley working on machine learning and one of the first proponents of Bayesian Networks as learning models. Although I don’t know his work very well, a friend, whose opinion I value a lot, recommended him as the Midas of research. Everything he publishes becomes a hit after a couple of years. So a sure bet to ride.

Bradford Cross

Bradford Cross is the only non academic in this list. He is co-founder and head of research of FlightCaster, a small startup that predicts flight delays using Statistical Learning techniques. He presents the right balance between engineering and research that appeals to me the most. He can write posts about using Clojure for Mining the web or using it for managing Hadoop and Cascade instances or write excellent posts about the best way to self-learn Statistical Learning Theory.

Yann LeCun

Yann LeCun and Geoffrey Hinton are two researchers that started in the early 90′s doing work on Neural Networks.  If you have taken any course on Neural Networks you probably have seen mentioned the 10 digit character recognition dataset and how NN are used for learning it. This dataset was compiled by LeCun. In this decade, they have realized the limitations of the Neural Network architectures and they are now working in a new area of Machine Learning called Deep Learning. In deep learning architectures, the learning agents are chained in layers that work with low level data to provide information to upper layers, which in time are using this information as low level input. The ambitious goal is to generate true Artificial Intelligence, so worth checking if they succeed or not.

Jürgen Schmidhuber

Marcus Hutter

Jürgen Schmidhuber and Marcus Hutter. Hutter was working under the supervision of the Schmidhuber in a Theory of Universal Learning Machines that tries to unify Algorithmic Learning Theory with Reinforcement Learning. As the name points, this theory aims at building a universal problem solver and thus true AI. Marcus’ book has been sitting in my reading pile for a long time. It is a difficult read because the book rapidly moves into layers of abstraction when it is supposed to be explaining learning algorithms, but anyway I feel a strange attraction to this book and I will surely read it in the future.

So this is my Hall of Fame for 2011. Lots of reading and learning.

Categories: Uncategorized Tags: ,

Why I am learning Haskell

November 5th, 2010 2 comments

I can not remember exactly when and why I heard of Haskell in the first time. What I can recall vividly is the moment when I decided that I was committed to learn the language. It happened when reading the article “Beating the averages” from Paul Graham. In this article, Paul Graham describes how he was coding in Lisp the software of his startup, an online store called ViaWeb that later was acquired by Yahoo and became Yahoo Store. He says that coding in Lisp was their main advantage against possible competitors. He even jokes about not worrying of competitors that where hiring Oracle or Java specialists, he knew they were doomed from then on. This article resonated inside me when I read it because at that time I was tinkering with the idea of starting a Micro-ISV as a side project. Haskell was going to be for me what Lisp was for Paul Graham, the high level language able to increase the productivity of a solo programmer by several folds.

On the other hand, learning Haskell is an stimulating intellectual task. Pure functional programming is something worth learning by itself. It will make you a better programmer even if you never happen to write a single functional line of code. Your imperative style will also improve. Like quiting smoking, Haskell is very easy to learn, I’ve learned it a lot of times. I started with a Spanish book, Razonando con Haskell, that makes a very good job introducing the language but that fails blatantly from a pedagogical point of view when you get into monads and the design of programs. But then, Real World Haskell came out and IMHO it has helped to push forward Haskell into a real mainstream language. If you are planning to learn the language don’t forget to read also Learn You a Haskell for Great Good! a high quality freely available Haskell book.

More recently, and related to my current work, Haskell seems the perfect tool for writing parallel programs. Haskell is recommended by data scientists for doing real programming. Due to its pure functional nature, bugs are easier to spot in Haskell because the language design difficults introducing side effects that are so common in other languages. Haskell also supports esoteric concurrent constructs you haven’t seen in other programming languages like Software Transactional Memory. But above all, pure code can be easily manipulated and reasoned about by other programs. That means that you, as Joe the programmer, can rely on the compiler for doing automatic parallelization. You can easily parallelize already functioning algorithms by just doing some code annotations, the compiler will do all the spawning and synchronization under the hood.

In this vein, there is an under explored area that I’d like to learn more about, Nested Data Parallelism. Unless you have been living in a cave during the past years, the MapReduce programming paradigm and its reference implementation Hadoop have taken the parallel world by storm. This new paradigm allows to process data volumes that were unthinkable some years ago. The MapReduce paradigm represents a pattern for parallelization called Flat Data Parallelism. This means that in these problems, data is more important than computation in contrast to usual scientific calculations where operations are the dominant factor. The flat word means that the problems that are best suited for this paradigm are those in which the data involved can easily be bucketed on equal sized bins and processed in parallel. MapReduce is the best tool for performing ETL in a data warehouse, indexing the web like Google does, log mining, or image transformations. Huge amounts of independent data, processed on parallel tasks. But what happens when you have irregular data like trees or graphs? You can not partition the data into equally sized bins because you don’t know the size of each bin. For instance, imagine that you want to multiply a big sparse matrix with a vector (an operation that happens a lot in Recommender Systems using Collaborative Filtering). Your first thought will be to split equally the rows in the matrix and send each bin to a thread or computing node, but because of the sparsity of the matrix, some rows are short while others are longer. That means that some bins take longer than others to compute introducing delays when the main program has to wait to gather the results. Data Parallel Haskell is an attempt to implement Nested Data Parallelism in the Haskell language itself. Simont Peyton Jones and other Haskell gurus are working hard to explore this new kind of parallelism using the facilities of the language.

Categories: Programming Tags:

Continuous Integration for C++ using Hudson

October 15th, 2010 4 comments

This post is about how to setup Hudson for continuous integration (CI) of a C++ project. This is mainly for my own (future) reference and I expect that it will be continuously improved (pun intended) in the future.

Preparing the machine that will host the CI

Hudson will be hosted in a virtualized Linux server. The OS will be Ubuntu 10.04 because is the distribution I am most familiar with and because it is a long term support (LTS) version of the distribution. It is expected that this version will be stable enough and that bugs will be addressed and resolved quickly. There will be no physical machine where to install Ubuntu. I will install it on a Virtual Machine created using the VirtualBox virtualization software. Setting up a virtual machine using VirtualBox is beyond the scope of this post, I’ll assume that you have successfully installed Ubuntu and that it is up and running. If you are going to start this task right now just make sure you reserve 1G of memory for the virtual machine since the default 512M seems that is not enough for running the Tomcat server.

At this point you should have a newly installed Ubuntu 10.04 up and running. The next step is to install all the necessary software. It involves:

  • Sun’s Java SDK, to run the Tomcat server
  • Tomcat 6, it hosts Hudson
  • Mercurial, the example project is hosted on Bitbucket.org but you can use whichever
  • CMake, the multiplatform build system used in the example
  • CppUnit, testing library for those beautiful unit tests
  • CCCC, application to calculate cyclomatic complexity of our code
  • lcov, to produce nice code coverage reports of our unit tests
  • valgrind, that checks memory leaks in our code.

First append the repository where the Java SDK can be found

sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update

and now, install the packages

sudo apt-get install tomcat6 tomcat6-admin sun-java6-jdk mercurial cmake libcppunit-dev build-essential cccc lcov valgrind

Installing and configuring Tomcat and Hudson

Just some standard Tomcat configuration here. Change tomcat’s user password to any you can remember.

sudo passwd tomcat6

Download Hudson

cd /tmp
sudo -u tomcat6 wget http://hudson-ci.org/latest/hudson.war

Edit the file with tomcat’s users and create an administrator.

sudo vim /etc/tomcat6/tomcat-users.xml
<tomcat-users>
  <role rolename="tomcat"/>
  <role rolename="role1"/>
  <user username="tomcat" password="tomcat" roles="tomcat"/>
  <user username="both" password="tomcat" roles="tomcat,role1"/>
  <user username="role1" password="tomcat" roles="role1"/>
  <role rolename="manager"/>
  <user username="admin" password="admin1" roles="manager"/>
</tomcat-users>

And now, start the tomcat server

sudo chown -R tomcat6:tomcat6 /usr/share/tomcat6
sudo /etc/init.d/tomcat6 start

And now, to deploy Hudson, open a browser and go to this URL http://localhost:8080/manager/html. Use the admin user you previously created and in the dashboard you are seeing now, scroll down to the War file to deploy and select the hudson war that you downloaded in the /tmp directory. Deploy and if everything goes OK, Hudson will be installed and you can access it in this URL http://localhost:8080/hudson
Just for the curious of you, all the intermediate files and directories that Hudson uses go into /usr/share/tomcat6/.hudson

Now, you have Hudson up and ready, go to http://localhost:8080/hudson and navigate to the Hudson plugins page (Manage Hudson -> Manage Plugins -> Available) and select and install the following plugins:

  • Mercurial plugin
  • CppUnit plugin
  • CCCC plugin
  • HTML publisher plugin
  • Python Plugin

Install them and let Hudson to restart itself.

Configure the project

Time to monitor an existing project for continuous integration. Here, we will use a sample project described in a previous post. So, go to http://localhost:8080/hudson
and click on New Job

  1. Select a name for your project (Calc) and chose the option Build a free-style software project
  2. Put whatever description you want. Check also the Discard old builds, CI is about monitoring your actual project health not about having a build historical. I usually put 50 in the Max # of builds to keep
  3. In the Source Code Management section, select Mercurial as your VCS and put http://bitbucket.org/ancechu/cmake_sample_project as the Repository URL in the branch put, default.
  4. In the Build Triggers section, I like almost instant reaction to a build break. Hudson continuously polls the Mercurial server for changes, and when one happens it checkouts and builds the entire project. Select Poll SCM and in the Schedule write * * * * * . It means poll every single minute the repository.
  5. Ok, so here are the specific steps to build your project:
    1. First Add Build Step and select Execute shell script. The commands are:
      mkdir -p Release
      cd Release
      cmake -DCMAKE_BUILD_TYPE=Release ..
      make
      

      This creates the releasable version of our example calculator.

    2. Add another Shell Build Script step that is intended to create the unit test coverage reports. In order to get meaningful reports, the executables are compiled with debugging info included. lcov is used to create meaningful reports. Check that there is one line that removes coverage of uninteresting sources (those under /usr/include)
      # Run and execute tests with coverage info
      mkdir -p Debug
      mkdir -p coverage
      cd Debug
      cmake -DCMAKE_BUILD_TYPE=Debug -DCOVERAGE=True ..
      make
      cd ..
      lcov -d Debug/CMakeFiles/calc.dir/src/main/cpp -d Debug/CMakeFiles/UnitTester.dir/src/test/cpp --zerocounters
      Debug/UnitTester
      lcov -d Debug/CMakeFiles/calc.dir/src/main/cpp -d Debug/CMakeFiles/UnitTester.dir/src/test/cpp --capture --output-file coverage/calc.info
      lcov -r coverage/calc.info "/usr/include/*" -o coverage/calc.info
      genhtml -o coverage coverage/calc.info
    3. Again, another Execute shell step for creating C++ metrics
      cccc `find src -name *.cpp`
    4. This step is a Execute shell step with the goal of creating a report about the possible memory leaks when executing our tests
      valgrind --xml=yes --xml-file=valgrind.xml Release/UnitTester
    5. The last step is the execution of a Python script that converts the XML output from valgrind to XML output that resembles the xUnit output schema. The idea is to treat memory leaks or other memory related problems as failed unit tests. This is a quick fix for integrating Valgrind in Hudson, if you know a better way please let me know. Again, Add Build Step but this time Python script
      import xml.etree.ElementTree as ET
      doc = ET.parse('valgrind.xml')
      errors = doc.findall('//error')
      
      out = open("valgrindTestResults.xml","w")
      out.write('<?xml version="1.0" encoding="UTF-8"?>\n')
      out.write('<testsuite name="memcheck" tests="1" errors="0" failures="'+str(len(errors))+'" skip="0">\n')
      out.write('    <testcase classname="ValgrindMemoryCheck " \n')
      out.write('              name="Memory check" time="0">\n')
      for error in errors:
      	kind = error.find('kind')
      	what = error.find('what')
      	if  what == None:
      		what = error.find('xwhat/text')
      
      	out.write('        <error type="'+kind.text+'">\n')
      	out.write('            '+what.text+'\n')
      	out.write('        </error>\n')
      out.write('    </testcase>\n')
      out.write('</testsuite>\n')
      out.close()

      I have created a Mercurial project where improvements to this script will be checked in. Please contribute!!

      Update: I’ve included in the code base the XSL that is mentioned in the comments. Use whatever approach is more useful to you.

  6. The final step is to run the Post build actions that create the visual reports of the build process. Check the following items
    1. Check the checkbox and introduce in Test Report XMLs this string **/valgrindTestResults.xml . This tells Hudson to look for this specific file and include it as part of the result set.
    2. Next action is to select in order to show in the reports about Code Coverage and Cyclomatic Complexity. The exact values are:
    3. Add also the test results from CppUnit checking Publish testing tools result report and inserting in the text box **/cpptestresults.xml
    4. The last step is to create a widget for reporting a summary of the Cyclomatic complexity checking the Publish CCCC report and setting the main file to .cccc/cccc.xml

Save this options page and you are done!! Keep in mind that some reports are not displayed until some builds are in place. Any comment or improvement about the overall procedure is welcome. Enjoy it ;)

Categories: Programming Tags: , , , , ,