Archive

Author Archive

Why I am learning Haskell

November 5th, 2010 2 comments

I can not remember exactly when and why I heard of Haskell in the first time. What I can recall vividly is the moment when I decided that I was committed to learn the language. It happened when reading the article “Beating the averages” from Paul Graham. In this article, Paul Graham describes how he was coding in Lisp the software of his startup, an online store called ViaWeb that later was acquired by Yahoo and became Yahoo Store. He says that coding in Lisp was their main advantage against possible competitors. He even jokes about not worrying of competitors that where hiring Oracle or Java specialists, he knew they were doomed from then on. This article resonated inside me when I read it because at that time I was tinkering with the idea of starting a Micro-ISV as a side project. Haskell was going to be for me what Lisp was for Paul Graham, the high level language able to increase the productivity of a solo programmer by several folds.

On the other hand, learning Haskell is an stimulating intellectual task. Pure functional programming is something worth learning by itself. It will make you a better programmer even if you never happen to write a single functional line of code. Your imperative style will also improve. Like quiting smoking, Haskell is very easy to learn, I’ve learned it a lot of times. I started with a Spanish book, Razonando con Haskell, that makes a very good job introducing the language but that fails blatantly from a pedagogical point of view when you get into monads and the design of programs. But then, Real World Haskell came out and IMHO it has helped to push forward Haskell into a real mainstream language. If you are planning to learn the language don’t forget to read also Learn You a Haskell for Great Good! a high quality freely available Haskell book.

More recently, and related to my current work, Haskell seems the perfect tool for writing parallel programs. Haskell is recommended by data scientists for doing real programming. Due to its pure functional nature, bugs are easier to spot in Haskell because the language design difficults introducing side effects that are so common in other languages. Haskell also supports esoteric concurrent constructs you haven’t seen in other programming languages like Software Transactional Memory. But above all, pure code can be easily manipulated and reasoned about by other programs. That means that you, as Joe the programmer, can rely on the compiler for doing automatic parallelization. You can easily parallelize already functioning algorithms by just doing some code annotations, the compiler will do all the spawning and synchronization under the hood.

In this vein, there is an under explored area that I’d like to learn more about, Nested Data Parallelism. Unless you have been living in a cave during the past years, the MapReduce programming paradigm and its reference implementation Hadoop have taken the parallel world by storm. This new paradigm allows to process data volumes that were unthinkable some years ago. The MapReduce paradigm represents a pattern for parallelization called Flat Data Parallelism. This means that in these problems, data is more important than computation in contrast to usual scientific calculations where operations are the dominant factor. The flat word means that the problems that are best suited for this paradigm are those in which the data involved can easily be bucketed on equal sized bins and processed in parallel. MapReduce is the best tool for performing ETL in a data warehouse, indexing the web like Google does, log mining, or image transformations. Huge amounts of independent data, processed on parallel tasks. But what happens when you have irregular data like trees or graphs? You can not partition the data into equally sized bins because you don’t know the size of each bin. For instance, imagine that you want to multiply a big sparse matrix with a vector (an operation that happens a lot in Recommender Systems using Collaborative Filtering). Your first thought will be to split equally the rows in the matrix and send each bin to a thread or computing node, but because of the sparsity of the matrix, some rows are short while others are longer. That means that some bins take longer than others to compute introducing delays when the main program has to wait to gather the results. Data Parallel Haskell is an attempt to implement Nested Data Parallelism in the Haskell language itself. Simont Peyton Jones and other Haskell gurus are working hard to explore this new kind of parallelism using the facilities of the language.

Categories: Programming Tags:

Continuous Integration for C++ using Hudson

October 15th, 2010 4 comments

This post is about how to setup Hudson for continuous integration (CI) of a C++ project. This is mainly for my own (future) reference and I expect that it will be continuously improved (pun intended) in the future.

Preparing the machine that will host the CI

Hudson will be hosted in a virtualized Linux server. The OS will be Ubuntu 10.04 because is the distribution I am most familiar with and because it is a long term support (LTS) version of the distribution. It is expected that this version will be stable enough and that bugs will be addressed and resolved quickly. There will be no physical machine where to install Ubuntu. I will install it on a Virtual Machine created using the VirtualBox virtualization software. Setting up a virtual machine using VirtualBox is beyond the scope of this post, I’ll assume that you have successfully installed Ubuntu and that it is up and running. If you are going to start this task right now just make sure you reserve 1G of memory for the virtual machine since the default 512M seems that is not enough for running the Tomcat server.

At this point you should have a newly installed Ubuntu 10.04 up and running. The next step is to install all the necessary software. It involves:

  • Sun’s Java SDK, to run the Tomcat server
  • Tomcat 6, it hosts Hudson
  • Mercurial, the example project is hosted on Bitbucket.org but you can use whichever
  • CMake, the multiplatform build system used in the example
  • CppUnit, testing library for those beautiful unit tests
  • CCCC, application to calculate cyclomatic complexity of our code
  • lcov, to produce nice code coverage reports of our unit tests
  • valgrind, that checks memory leaks in our code.

First append the repository where the Java SDK can be found

sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update

and now, install the packages

sudo apt-get install tomcat6 tomcat6-admin sun-java6-jdk mercurial cmake libcppunit-dev build-essential cccc lcov valgrind

Installing and configuring Tomcat and Hudson

Just some standard Tomcat configuration here. Change tomcat’s user password to any you can remember.

sudo passwd tomcat6

Download Hudson

cd /tmp
sudo -u tomcat6 wget http://hudson-ci.org/latest/hudson.war

Edit the file with tomcat’s users and create an administrator.

sudo vim /etc/tomcat6/tomcat-users.xml
<tomcat-users>
  <role rolename="tomcat"/>
  <role rolename="role1"/>
  <user username="tomcat" password="tomcat" roles="tomcat"/>
  <user username="both" password="tomcat" roles="tomcat,role1"/>
  <user username="role1" password="tomcat" roles="role1"/>
  <role rolename="manager"/>
  <user username="admin" password="admin1" roles="manager"/>
</tomcat-users>

And now, start the tomcat server

sudo chown -R tomcat6:tomcat6 /usr/share/tomcat6
sudo /etc/init.d/tomcat6 start

And now, to deploy Hudson, open a browser and go to this URL http://localhost:8080/manager/html. Use the admin user you previously created and in the dashboard you are seeing now, scroll down to the War file to deploy and select the hudson war that you downloaded in the /tmp directory. Deploy and if everything goes OK, Hudson will be installed and you can access it in this URL http://localhost:8080/hudson
Just for the curious of you, all the intermediate files and directories that Hudson uses go into /usr/share/tomcat6/.hudson

Now, you have Hudson up and ready, go to http://localhost:8080/hudson and navigate to the Hudson plugins page (Manage Hudson -> Manage Plugins -> Available) and select and install the following plugins:

  • Mercurial plugin
  • CppUnit plugin
  • CCCC plugin
  • HTML publisher plugin
  • Python Plugin

Install them and let Hudson to restart itself.

Configure the project

Time to monitor an existing project for continuous integration. Here, we will use a sample project described in a previous post. So, go to http://localhost:8080/hudson
and click on New Job

  1. Select a name for your project (Calc) and chose the option Build a free-style software project
  2. Put whatever description you want. Check also the Discard old builds, CI is about monitoring your actual project health not about having a build historical. I usually put 50 in the Max # of builds to keep
  3. In the Source Code Management section, select Mercurial as your VCS and put http://bitbucket.org/ancechu/cmake_sample_project as the Repository URL in the branch put, default.
  4. In the Build Triggers section, I like almost instant reaction to a build break. Hudson continuously polls the Mercurial server for changes, and when one happens it checkouts and builds the entire project. Select Poll SCM and in the Schedule write * * * * * . It means poll every single minute the repository.
  5. Ok, so here are the specific steps to build your project:
    1. First Add Build Step and select Execute shell script. The commands are:
      mkdir -p Release
      cd Release
      cmake -DCMAKE_BUILD_TYPE=Release ..
      make
      

      This creates the releasable version of our example calculator.

    2. Add another Shell Build Script step that is intended to create the unit test coverage reports. In order to get meaningful reports, the executables are compiled with debugging info included. lcov is used to create meaningful reports. Check that there is one line that removes coverage of uninteresting sources (those under /usr/include)
      # Run and execute tests with coverage info
      mkdir -p Debug
      mkdir -p coverage
      cd Debug
      cmake -DCMAKE_BUILD_TYPE=Debug -DCOVERAGE=True ..
      make
      cd ..
      lcov -d Debug/CMakeFiles/calc.dir/src/main/cpp -d Debug/CMakeFiles/UnitTester.dir/src/test/cpp --zerocounters
      Debug/UnitTester
      lcov -d Debug/CMakeFiles/calc.dir/src/main/cpp -d Debug/CMakeFiles/UnitTester.dir/src/test/cpp --capture --output-file coverage/calc.info
      lcov -r coverage/calc.info "/usr/include/*" -o coverage/calc.info
      genhtml -o coverage coverage/calc.info
    3. Again, another Execute shell step for creating C++ metrics
      cccc `find src -name *.cpp`
    4. This step is a Execute shell step with the goal of creating a report about the possible memory leaks when executing our tests
      valgrind --xml=yes --xml-file=valgrind.xml Release/UnitTester
    5. The last step is the execution of a Python script that converts the XML output from valgrind to XML output that resembles the xUnit output schema. The idea is to treat memory leaks or other memory related problems as failed unit tests. This is a quick fix for integrating Valgrind in Hudson, if you know a better way please let me know. Again, Add Build Step but this time Python script
      import xml.etree.ElementTree as ET
      doc = ET.parse('valgrind.xml')
      errors = doc.findall('//error')
      
      out = open("valgrindTestResults.xml","w")
      out.write('<?xml version="1.0" encoding="UTF-8"?>\n')
      out.write('<testsuite name="memcheck" tests="1" errors="0" failures="'+str(len(errors))+'" skip="0">\n')
      out.write('    <testcase classname="ValgrindMemoryCheck " \n')
      out.write('              name="Memory check" time="0">\n')
      for error in errors:
      	kind = error.find('kind')
      	what = error.find('what')
      	if  what == None:
      		what = error.find('xwhat/text')
      
      	out.write('        <error type="'+kind.text+'">\n')
      	out.write('            '+what.text+'\n')
      	out.write('        </error>\n')
      out.write('    </testcase>\n')
      out.write('</testsuite>\n')
      out.close()

      I have created a Mercurial project where improvements to this script will be checked in. Please contribute!!

      Update: I’ve included in the code base the XSL that is mentioned in the comments. Use whatever approach is more useful to you.

  6. The final step is to run the Post build actions that create the visual reports of the build process. Check the following items
    1. Check the checkbox and introduce in Test Report XMLs this string **/valgrindTestResults.xml . This tells Hudson to look for this specific file and include it as part of the result set.
    2. Next action is to select in order to show in the reports about Code Coverage and Cyclomatic Complexity. The exact values are:
    3. Add also the test results from CppUnit checking Publish testing tools result report and inserting in the text box **/cpptestresults.xml
    4. The last step is to create a widget for reporting a summary of the Cyclomatic complexity checking the Publish CCCC report and setting the main file to .cccc/cccc.xml

Save this options page and you are done!! Keep in mind that some reports are not displayed until some builds are in place. Any comment or improvement about the overall procedure is welcome. Enjoy it ;)

Categories: Programming Tags: , , , , ,

Distributed Version Control System with Dropbox and Mercurial

July 13th, 2010 1 comment

When developing software you should use a Version Control System like Subversion, Git or Mercurial. It will spare you some nasty nightmares and it can also add you the discipline of focused work if you use it in combination with TDD or other XP practices. I used Subversion for VCS but my personal working situation made SVN a poor choice.

My requisites for a truly distributed Version Control System can be summarized in the following list:

  • Be able to work I work with several computers (home, laptop, notebook, parents’ computers, …) and I want to be able to access the repository over the Internet. This can be solved using one of the existing public repositories like Github, Bitbucket or Sourceforge.
  • I need a repository backed up in the cloud for preventing data loss.
  • I need to create private repositories.
  • I need to work with the VCS even when I have no Internet connection. This requisite discards classical VCS like CVS or Subversion and only allows newer ones like Mercurial, Git or Darcs.

The solution I have found very satisfying is to create a repository in a folder controlled by Dropbox, treat this as the “main” repository and clone a working copy in a local workspace. After that, I can work off-line doing commits against my working copy and by means of the push and pull commands I synchronize each working copy on each of the different computers.

The only issue I have to take into account is to not push changes when I have no Internet connection and Dropbox can not synchronize to the cloud. If I would do that in two computers without connection I could incur on versioning conflicts that Dropbox could not resolve. This is not an issue because I can work locally as long as I want without any problem. Mercurial is able to track new commits and clone the local working copy just waiting for the moment when net connectivity is working again.

Check this great tutorial of Mercurial written by Joel Spolsky in order to learn how to work with this Distributed Version Control System.

PD. If you have found this article interesting and you don’t have a Dropbox account, please consider opening your account using this referral, we’ll both get extra space in our accounts ;)

Categories: Programming Tags: ,

CMake project skeleton

March 22nd, 2010 1 comment

When creating a new project from scratch, there are a lot of tasks that are repetitive and uninteresting. Creating makefiles, defining the project structure, etc… Usually you rely on your IDE to perform those tasks when you are on your own when creating CMake projects. I’ve created a simple project skeleton to speed up the setup of new C++ projects.

Sample project

You can access the project here. Just clone it and start the hacking.

Directory structure

Recently I have been working with Maven for Java development. I love the idea of Convention over Configuration (CoC). When coding, there are zillions of things you have to put in your head, why should you waste your precious brain cycles remembering in which folder are those configuration files your colleague created last week. You agree on an arbitrary convention and you stick to it. Suddenly some neurons are freed and you can use them to learn the next framework. When organizing my folders I have followed the Maven conventions, one big CMakeLists.txt file in the root directory, a src folder divided into main and test having each of them subfolders on a “per-technology” basis. Object files and executables go to a separated folder. In the figure you can see the directories structure.

Unit testing

In order to provide some functionality, the project creates a dynamic library with the functionality of a simple calculator. To test that this functionality works properly, a unit test has been created using the CppUnit framework. The project builds a test driver called UnitTester. This executable can be run without parameters and then it runs all the tests that were previously registered using the CPPUNIT_TEST_SUITE_REGISTRATION. If you want to run only a given set you specify it in the command line, useful for testing the test you are working on.

I don’t use the CTest functionality in this project and so, the make test command does not work. I did this because I preferred to have an executable in the root binary folder instead of buried in the folder structure. Some situations like running Valgrind to check for memory leaks or controlling the exact location of the executable compensate the change.

Packaging

In the cmake folder there are two scripts for defining the packaging. You can create a RPM or a DEB only by (un)commenting the corresponding lines in the root CMakeLists.txt.

Future development

I plan to continue updating this project skeleton in the future. Some issues that I will surely improve will be:

  • Find a satisfying solution for running the tests using the CTest philosophy.
  • Include the code that finds common libraries like Boost Done! 2010-05-27
  • Given that CMake is a crossplatform build environment, check that this project can run seamlessly in Windows or Mac, currently only tested in Linux.

Categories: Programming Tags: , ,

DFA libraries in Haskell

March 7th, 2010 No comments

Since I’m planning to write all the software for my final Master’s Thesis in Haskell, the first natural step is to look for libraries already implementing Probabilistic Discrete Finite Automata in Haskell. The probabilistic condition was too much constraining and I have been only able to find some implementations of simple DFA. Here are the ones I have found:

  • Halex is a library for manipulating regular languages. It’s a library developed at the University of Minho that also serves as the base lexer for the HaGLR (a generalized LR parser). It has the nice feature of generating Graphviz diagrams from the DFAs. It is distributed under the LPGL
  • In this page you can download the software implementation that is explained in this paper (A Collection of Functional Libraries for Theory of Computation). The software is distributed under the LPGL.
  • When browsing through the Haskell Wiki you can find the regular expressions implementations. But the implementation of reference for the DFAs seems that is embedded in a more complex library, the Compiler Toolkit for Haskell.
  • A number of implementations written by Haskell enthusiasts and documented in their blogs like this one and this set of articles.

Given this quick research of the available possibilities, the final choice will be to use the implementation described in the paper A Collection of Functional Libraries for Theory of Computation for the creation of a Probabilistic DFA library. Here are the reasons:

  • This is a LPGL library so I can use it without restrictions. I will try to get a clean code so I could contribute back.
  • The library seems that is focused on the DFA where other options are more focused on creating a good lexer to be included on a bigger piece of code like a parser or similar.
  • It is well documented and can help me to understand the underlying design decisions.
  • It is very small and I won’t get lost in the subtleties of a bigger implementation. Since I am a novice Haskeller this is a plus.

Categories: Thesis Tags: ,