Archive

Posts Tagged ‘c++’

CSV parser in C++ with Boost and Template Metaprogramming

December 28th, 2011 No comments

One common situation when analyzing big chunks of data is to parse big CSV files with a record structure on each line. That is, all lines conform to a fixed schema where each line represents a record and each record has several columns of different types. Your objective is to parse and fill a data structure like this

struct foo {
    int a;
    std::string b;
    double c;
};

for each record. All the information needed for such parsing is in this data structure so I wondered whether I could write a program when you pass that type information and the parsing is done automatically for you. This is my first attempt at getting a flexible and fast CSV parser in C++. It is hosted in Github.

Design

When parsing CSV files the common usage pattern is to iterate through each line and perform a given action on each record. No need for a big CSV container or something similar, so the best approach is to write a class that acts as an iterator. When derreferencing the iterator it will return the provided data structure filled with parsed data from the current line.

An iterator is associated to a container but in this case, it will be constructed accepting an std::istream representing the CSV file. By accepting this istream we will be able to parse strings using the std::stringstream class, regular files, or compressed files using the Boost Iostream library.

The iterator must have all the common operators for it to interoperate with the STL algorithms seamlessly. The empty constructor will mark the iterator’s end-of-range position that coincides with the end of file. A typical use case will be to have some code like this:

#include <fstream>
#include <boost/tuple/tuple.hpp>
#include <csv_iterator.hpp>

using namespace boost::tuples;
typedef tuple<int,std::string,double> record;

void myFunc(const record& value){
    // Your code here
}

int main(int argc, const char *argv[])
{
    // Example of csv_iterator usage
    std::ifstream in("myCsvFile.csv");
    csv::iterator<record> it(in);

    // Use the iterator in your algorithms.
    std::for_each(it, csv::iterator<record>(), myFunc);

    return 0;
}

for parsing CSV files like this:

1,hola,3.14
2,adios,2.71

Implementation

For obtaining the values on each line, the library uses the Boost Tokenizer library. From a string you create a tokenizer that splits the input string by the character delimiter that by default is the comma. It also takes care of escaping characters. Accessing the different tokens is granted by the tokenizer by providing a token iterator.

using namespace boost;
typedef tokenizer<escaped_list_separator<char> > Tokens;
Tokens tokens(currentLine);
// Tokens can be accessed using tokens.begin() and tokens.end() iterators

Once we have the strings representing different values we have to parse and convert them into a type. Bad data format exceptions happen here and can be spotted earlier. For parsing using an unified approach the library uses the Boost Lexical Cast Library. This library provides a uniform and portable interface for doing text conversions

int myInt = boost::lexical_cast<int>("5");
double myDouble = boost::lexical_cast<double>("3.14");

For the record data structure the library uses the Boost Tuple Library that provides a Plain “old” Data Structure for storing the record fields. Moreover this class provides some template infrastructure that helps in the metaprogramming trick that follows.

Template metaprogamming

For our library the number of columns and the datatype is implicit in the record type. Our algorithm for parsing is basic, depending on the field type, parse the Nth string accordingly, and assign the result to the Nth field. Repeat for all the record fields. This parametric polymorphism must be combined with the dynamic access of the string tokenization. The former can be obtained in C++ using Template metaprogramming. The code that makes the trick is this one

template<class Tuple, int N >
    struct helper {
        static void fill(Tuple& tuple, strIt& it, strIt& end){
            using namespace boost::tuples;
            typedef typename element<length<Tuple>::value-N-1,Tuple>::type value_type;
            checkIteratorRange(it,end);
            get<length<Tuple>::value-N-1>(tuple) = boost::lexical_cast<value_type>(it->c_str());
            ++it;
            helper<Tuple,N-1>::fill(tuple,it,end);
        }
    };

template<class Tuple>
    struct helper<Tuple, 0> {
        static void fill(Tuple& tuple, strIt& it, strIt& end){
            using namespace boost::tuples;
            typedef typename boost::tuples::element<length<Tuple>::value-1,Tuple>::type value_type;
            checkIteratorRange(it,end);
            boost::tuples::get<length<Tuple>::value-1>(tuple) = boost::lexical_cast<value_type>(it->c_str());
            ++it;
        };
    };

template<class Tuple>
    struct filler {
       static void fill(Tuple& tuple, strIt&& it,strIt&& end){
            checkIteratorRange(it,end);
            helper<Tuple, boost::tuples::length<Tuple>::value-1>::fill(tuple,it,end);
        }
    };

Yes, C++ syntax sucks!! But basically what we are doing here is a common pattern of functional programming, tail recursion. We define structures that contain static functions. The filler structure just initializes the recursive call by instantiating an instance of the helper paremeterized structure setting the length recursion to the number of fields in the tuple. This structure has to be specialized for the 0 value in order for the recursion to stop. All this functionality is done behind the curtain by the compiler (compilation time increases when using template metaprogramming) but the generated code will be something very similar to this pseudocode:

boost::tuples::get<0>(tuple) = (casting_here)*it;
++it;
boost::tuples::get<1>(tuple) = (casting_here)*it;
++it;
boost::tuples::get<2>(tuple) = (casting_here)*it;
++it;

almost identical to the code we would have written by hand. The compiler is the one who knows how many fields our structure has at compile time and generates as many efficient instructions as needed.

Conclusion

I wanted to stretch my programming muscles by coding a generic, flexible and fast CSV parser in C++ using template metaprogramming. The generic and flexible parts of the task have been accomplished but not the performance objectives. Although the library is fast enough for most tasks, it isn’t in a scenario of Big Data parsing big files. The tokenizer iterator is incurring a performance penalty each time I try to derreference it, since it creates and allocates memory for a std::string each time we invoke *it. This memory allocation is a performance drain doing useless work because we need this information only for parsing and getting a value, but the string is discarded thereafter. It would be better to perform an in-place parsing using the string allocated when reading the lines of the file. To that end it will be enlightening in the future to try this same exercise with more performant string libraries like the C++ String Toolkit and see the differences in performance.

Categories: Programming Tags: , , ,

Maximal clique enumeration using C++0x and the STL

August 29th, 2011 No comments

Lately I’ve started to take a look at how parallelism is being done in a pure functional language like Haskell and their related technologies Haskell in the Cloud and Data Parallel Haskell. As a proof of concept and in order to better learn those techniques, I want to parallelize the Bron-Kerbosch algorithm that returns the set of maximal cliques in a graph.The Bron-Kerbosch algorithm in pseudocode is something like this:

function bronKerbosch()
compsub = []
cand = V
not = []
cliqueEnumerate(compsub, cand, not)

and the cliqueEnumerate function in pseudo-code:

function cliqueEnumerate(compsub, cand, not)
if cand = [] then
    if not = [] then
        Output compsub
else
    fixp = The vertex in cand that is connected to the greatest number of other vertices in cand
    cur_v = fixp
    while cur_v != NULL do
        new_not = All vertices in not that are connected to cur_v
        new_cand = All vertices in cand that are connected to cur_v
        new_compsub = compsub + cur_v
        cliqueEnumerate(new_compsub, new_cand, new_not)
        not = not + cur_v
        cand = cand - cur_v
        if there is a vertex v in cand that is not connected to fixp then
           cur_v = v
        else
           cur_v = NULL

This pseudocode is from the paper, A scalable, parallel algorithm for maximal clique enumeration


I have written a first version of this algorithm in Haskell but, as a baseline and because I wanted to test the new features of the new C++0x standard, I’ve written this algorithm in C++ making extensive use of the STL and the lambdas. I forgot how verbose C++ is but the addition of the keyword auto and the lambdas has somehow alleviated it a little.

void Graph::cliqueEnumerate(const vector<int>& compsub,
                     vector<int> cand,
                     vector<int> cnot,
                     vector<vector<int> >& result) const {

    // Function that answer whether the node is conected
    if(cand.empty()){
        if(cnot.empty()){
            // New clique found
            result.push_back(compsub);
        }
    } else{
        int fixp = findFixP(cand);
        int cur_v = fixp;

        while(cur_v != -1){
            vector<int> new_not;
            vector<int> new_cand;
            vector<int> new_compsub;

            // Auxiliar lambda function
            auto isConected =[cur_v,this](int x) {
                const vector<int>& edges = this->getEdges(x);
                return find(edges.begin(), edges.end(), cur_v) != edges.end();
            }; 

            // Compose new vector
            // Avoid performance bottlenecks by reserving memory before hand
            new_compsub.reserve(compsub.size()+1);
            new_not.reserve(cnot.size());
            new_cand.reserve(cand.size());
            copy_if(cnot.begin(),cnot.end(),back_inserter(new_not),isConected);
            copy_if(cand.begin(),cand.end(),back_inserter(new_cand),isConected);
            copy(compsub.begin(), compsub.end(), back_inserter(new_compsub));
            new_compsub.push_back(cur_v);

            // Recursive call
            cliqueEnumerate(new_compsub, new_cand, new_not, result);

            // Generate new cnot and cand for the loop
            cnot.push_back(cur_v);
            cand.erase(find(cand.begin(),cand.end(),cur_v));

            // Last check
            auto v = find_if(cand.begin(),
                             cand.end(),
                             [fixp,this](int x) {
                                const vector<int>& edges = this->getEdges(x);
                                return find(edges.begin(), edges.end(), fixp) == edges.end();
                             });

            // Obtain new cur_v value
            if(v != cand.end()) cur_v = *v;
            else break;
        }
    }
}

and the helper function findFixP is:

int Graph::findFixP(vector<int> cand) const {
    vector<int> connections;
    connections.resize(cand.size());

    // This is necessary for the set_intersection function
    sort(cand.begin(),cand.end());

    // Auxiliar lambda function
    auto intersection = [&cand, this](int x) -> int {
        const vector<int>& x_edges = this->getEdges(x);
        vector<int> intersection;

        set_intersection(x_edges.begin(), x_edges.end(),
                         cand.begin(), cand.end(),
                         back_inserter(intersection));
        return intersection.size();
    };

    // Create an auxiliar vector with the intersection sizes
    transform(cand.begin(),cand.end(),connections.begin(),intersection);

    // Find the maximum size and return the corresponding edge
    vector<int>::const_iterator it1, it2,itmax;
    int max = -1;
    itmax = cand.end();
    for(it1=cand.begin(),it2=connections.begin(); it1!=cand.end(); ++it1,++it2){
        if(max < *it2){
            max = *it2;
            itmax = it1;
        }
    }
    if(itmax == cand.end()) return -1;
    else return *itmax;
    }
}

For this function my first attempt was to write it using the std::max_element function, but I was worried that since the function we pass isn’t a transformation of the data but a comparison function (less), I was worried that on each comparison the set_intersection would be computed redundantly on each step.

There can be, for sure, room for improvement (any C++ guru in the audience?), but I’m pretty satisfied with the obtained implementation. It reads almost as the pseudocode. I think this is because I first wrote the Haskell version and the C++ has the functional flavor in it. Would I write first the C++ version and there would be for sure lots of nasty loops and array indexes.

Categories: Programming Tags: , ,

Continuous Integration for C++ using Hudson

October 15th, 2010 4 comments

This post is about how to setup Hudson for continuous integration (CI) of a C++ project. This is mainly for my own (future) reference and I expect that it will be continuously improved (pun intended) in the future.

Preparing the machine that will host the CI

Hudson will be hosted in a virtualized Linux server. The OS will be Ubuntu 10.04 because is the distribution I am most familiar with and because it is a long term support (LTS) version of the distribution. It is expected that this version will be stable enough and that bugs will be addressed and resolved quickly. There will be no physical machine where to install Ubuntu. I will install it on a Virtual Machine created using the VirtualBox virtualization software. Setting up a virtual machine using VirtualBox is beyond the scope of this post, I’ll assume that you have successfully installed Ubuntu and that it is up and running. If you are going to start this task right now just make sure you reserve 1G of memory for the virtual machine since the default 512M seems that is not enough for running the Tomcat server.

At this point you should have a newly installed Ubuntu 10.04 up and running. The next step is to install all the necessary software. It involves:

  • Sun’s Java SDK, to run the Tomcat server
  • Tomcat 6, it hosts Hudson
  • Mercurial, the example project is hosted on Bitbucket.org but you can use whichever
  • CMake, the multiplatform build system used in the example
  • CppUnit, testing library for those beautiful unit tests
  • CCCC, application to calculate cyclomatic complexity of our code
  • lcov, to produce nice code coverage reports of our unit tests
  • valgrind, that checks memory leaks in our code.

First append the repository where the Java SDK can be found

sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update

and now, install the packages

sudo apt-get install tomcat6 tomcat6-admin sun-java6-jdk mercurial cmake libcppunit-dev build-essential cccc lcov valgrind

Installing and configuring Tomcat and Hudson

Just some standard Tomcat configuration here. Change tomcat’s user password to any you can remember.

sudo passwd tomcat6

Download Hudson

cd /tmp
sudo -u tomcat6 wget http://hudson-ci.org/latest/hudson.war

Edit the file with tomcat’s users and create an administrator.

sudo vim /etc/tomcat6/tomcat-users.xml
<tomcat-users>
  <role rolename="tomcat"/>
  <role rolename="role1"/>
  <user username="tomcat" password="tomcat" roles="tomcat"/>
  <user username="both" password="tomcat" roles="tomcat,role1"/>
  <user username="role1" password="tomcat" roles="role1"/>
  <role rolename="manager"/>
  <user username="admin" password="admin1" roles="manager"/>
</tomcat-users>

And now, start the tomcat server

sudo chown -R tomcat6:tomcat6 /usr/share/tomcat6
sudo /etc/init.d/tomcat6 start

And now, to deploy Hudson, open a browser and go to this URL http://localhost:8080/manager/html. Use the admin user you previously created and in the dashboard you are seeing now, scroll down to the War file to deploy and select the hudson war that you downloaded in the /tmp directory. Deploy and if everything goes OK, Hudson will be installed and you can access it in this URL http://localhost:8080/hudson
Just for the curious of you, all the intermediate files and directories that Hudson uses go into /usr/share/tomcat6/.hudson

Now, you have Hudson up and ready, go to http://localhost:8080/hudson and navigate to the Hudson plugins page (Manage Hudson -> Manage Plugins -> Available) and select and install the following plugins:

  • Mercurial plugin
  • CppUnit plugin
  • CCCC plugin
  • HTML publisher plugin
  • Python Plugin

Install them and let Hudson to restart itself.

Configure the project

Time to monitor an existing project for continuous integration. Here, we will use a sample project described in a previous post. So, go to http://localhost:8080/hudson
and click on New Job

  1. Select a name for your project (Calc) and chose the option Build a free-style software project
  2. Put whatever description you want. Check also the Discard old builds, CI is about monitoring your actual project health not about having a build historical. I usually put 50 in the Max # of builds to keep
  3. In the Source Code Management section, select Mercurial as your VCS and put http://bitbucket.org/ancechu/cmake_sample_project as the Repository URL in the branch put, default.
  4. In the Build Triggers section, I like almost instant reaction to a build break. Hudson continuously polls the Mercurial server for changes, and when one happens it checkouts and builds the entire project. Select Poll SCM and in the Schedule write * * * * * . It means poll every single minute the repository.
  5. Ok, so here are the specific steps to build your project:
    1. First Add Build Step and select Execute shell script. The commands are:
      mkdir -p Release
      cd Release
      cmake -DCMAKE_BUILD_TYPE=Release ..
      make
      

      This creates the releasable version of our example calculator.

    2. Add another Shell Build Script step that is intended to create the unit test coverage reports. In order to get meaningful reports, the executables are compiled with debugging info included. lcov is used to create meaningful reports. Check that there is one line that removes coverage of uninteresting sources (those under /usr/include)
      # Run and execute tests with coverage info
      mkdir -p Debug
      mkdir -p coverage
      cd Debug
      cmake -DCMAKE_BUILD_TYPE=Debug -DCOVERAGE=True ..
      make
      cd ..
      lcov -d Debug/CMakeFiles/calc.dir/src/main/cpp -d Debug/CMakeFiles/UnitTester.dir/src/test/cpp --zerocounters
      Debug/UnitTester
      lcov -d Debug/CMakeFiles/calc.dir/src/main/cpp -d Debug/CMakeFiles/UnitTester.dir/src/test/cpp --capture --output-file coverage/calc.info
      lcov -r coverage/calc.info "/usr/include/*" -o coverage/calc.info
      genhtml -o coverage coverage/calc.info
    3. Again, another Execute shell step for creating C++ metrics
      cccc `find src -name *.cpp`
    4. This step is a Execute shell step with the goal of creating a report about the possible memory leaks when executing our tests
      valgrind --xml=yes --xml-file=valgrind.xml Release/UnitTester
    5. The last step is the execution of a Python script that converts the XML output from valgrind to XML output that resembles the xUnit output schema. The idea is to treat memory leaks or other memory related problems as failed unit tests. This is a quick fix for integrating Valgrind in Hudson, if you know a better way please let me know. Again, Add Build Step but this time Python script
      import xml.etree.ElementTree as ET
      doc = ET.parse('valgrind.xml')
      errors = doc.findall('//error')
      
      out = open("valgrindTestResults.xml","w")
      out.write('<?xml version="1.0" encoding="UTF-8"?>\n')
      out.write('<testsuite name="memcheck" tests="1" errors="0" failures="'+str(len(errors))+'" skip="0">\n')
      out.write('    <testcase classname="ValgrindMemoryCheck " \n')
      out.write('              name="Memory check" time="0">\n')
      for error in errors:
      	kind = error.find('kind')
      	what = error.find('what')
      	if  what == None:
      		what = error.find('xwhat/text')
      
      	out.write('        <error type="'+kind.text+'">\n')
      	out.write('            '+what.text+'\n')
      	out.write('        </error>\n')
      out.write('    </testcase>\n')
      out.write('</testsuite>\n')
      out.close()

      I have created a Mercurial project where improvements to this script will be checked in. Please contribute!!

      Update: I’ve included in the code base the XSL that is mentioned in the comments. Use whatever approach is more useful to you.

  6. The final step is to run the Post build actions that create the visual reports of the build process. Check the following items
    1. Check the checkbox and introduce in Test Report XMLs this string **/valgrindTestResults.xml . This tells Hudson to look for this specific file and include it as part of the result set.
    2. Next action is to select in order to show in the reports about Code Coverage and Cyclomatic Complexity. The exact values are:
    3. Add also the test results from CppUnit checking Publish testing tools result report and inserting in the text box **/cpptestresults.xml
    4. The last step is to create a widget for reporting a summary of the Cyclomatic complexity checking the Publish CCCC report and setting the main file to .cccc/cccc.xml

Save this options page and you are done!! Keep in mind that some reports are not displayed until some builds are in place. Any comment or improvement about the overall procedure is welcome. Enjoy it ;)

Categories: Programming Tags: , , , , ,

CMake project skeleton

March 22nd, 2010 1 comment

When creating a new project from scratch, there are a lot of tasks that are repetitive and uninteresting. Creating makefiles, defining the project structure, etc… Usually you rely on your IDE to perform those tasks when you are on your own when creating CMake projects. I’ve created a simple project skeleton to speed up the setup of new C++ projects.

Sample project

You can access the project here. Just clone it and start the hacking.

Directory structure

Recently I have been working with Maven for Java development. I love the idea of Convention over Configuration (CoC). When coding, there are zillions of things you have to put in your head, why should you waste your precious brain cycles remembering in which folder are those configuration files your colleague created last week. You agree on an arbitrary convention and you stick to it. Suddenly some neurons are freed and you can use them to learn the next framework. When organizing my folders I have followed the Maven conventions, one big CMakeLists.txt file in the root directory, a src folder divided into main and test having each of them subfolders on a “per-technology” basis. Object files and executables go to a separated folder. In the figure you can see the directories structure.

Unit testing

In order to provide some functionality, the project creates a dynamic library with the functionality of a simple calculator. To test that this functionality works properly, a unit test has been created using the CppUnit framework. The project builds a test driver called UnitTester. This executable can be run without parameters and then it runs all the tests that were previously registered using the CPPUNIT_TEST_SUITE_REGISTRATION. If you want to run only a given set you specify it in the command line, useful for testing the test you are working on.

I don’t use the CTest functionality in this project and so, the make test command does not work. I did this because I preferred to have an executable in the root binary folder instead of buried in the folder structure. Some situations like running Valgrind to check for memory leaks or controlling the exact location of the executable compensate the change.

Packaging

In the cmake folder there are two scripts for defining the packaging. You can create a RPM or a DEB only by (un)commenting the corresponding lines in the root CMakeLists.txt.

Future development

I plan to continue updating this project skeleton in the future. Some issues that I will surely improve will be:

  • Find a satisfying solution for running the tests using the CTest philosophy.
  • Include the code that finds common libraries like Boost Done! 2010-05-27
  • Given that CMake is a crossplatform build environment, check that this project can run seamlessly in Windows or Mac, currently only tested in Linux.

Categories: Programming Tags: , ,