Archive

Posts Tagged ‘algorithm’

Implementing LDA

November 28th, 2011 No comments

Lately I was playing with Latent Dirichlet Allocation (LDA) for a project at work. If for whatever reason you need to implement such algorithm perhaps you will save some time reading the walkthrough I did.

First you must be sure that LDA is the algorithm you are looking for. From a corpus of documents you will get K lists with words from your documents in them with a number assigned to each word denoting the relevance of the given word in the lists. Each list represents a topic, and that would be your topic description, no fancy words like “Computers”, “Biology” or “Life meaning”, just a set of words that a human must interpret. You could always assign a single name by picking the most prominent word in the list or treating the list as a valued vector and comparing it against a canonical topic description. So take a look at the first examples in this presentation and get inspired.

OK so you need some code to test how this method behaves with your particular data. The first thing to try is the topicmodels package from the R   statistical software package. This can give you an idea of the method and try to use it in a more serious Java application by means of the Mallet library.

But say that you need to create your own implementation because Java horrifies you or because you need a parallel version or whatever the reason. The first thing you have to do is to choose the inference method of your model between variational methods or gibbs sampling. This post will give you some ideas for picking the right method for your particular problem. The original papers picked the variational approach but I went through the Gibbs sampling method because I found this paper where all the mathematical derivations are nailed down. That way I was able to fully understand the method and at the same time being sure that my implementation was right and sound.  If you need more guidance, take a look at this simple implementation for getting an idea of the main functions and data structures you’ll have to code.

Once you have your code written you will have to check whether it is correct or not. The example in this paper using pixel positions and pixel intensities instead of words and word counts is very illustrating and will show visually the correctness of your implementation. Once you have your algorithm up and running perhaps you want to scale it up to more machines, so you could benefit from reading this paper  and taking also a look at this blog post from Alex Smola and their distributed implementation of LDA on Github.

Happy coding!!!

Maximal clique enumeration using C++0x and the STL

August 29th, 2011 No comments

Lately I’ve started to take a look at how parallelism is being done in a pure functional language like Haskell and their related technologies Haskell in the Cloud and Data Parallel Haskell. As a proof of concept and in order to better learn those techniques, I want to parallelize the Bron-Kerbosch algorithm that returns the set of maximal cliques in a graph.The Bron-Kerbosch algorithm in pseudocode is something like this:

function bronKerbosch()
compsub = []
cand = V
not = []
cliqueEnumerate(compsub, cand, not)

and the cliqueEnumerate function in pseudo-code:

function cliqueEnumerate(compsub, cand, not)
if cand = [] then
    if not = [] then
        Output compsub
else
    fixp = The vertex in cand that is connected to the greatest number of other vertices in cand
    cur_v = fixp
    while cur_v != NULL do
        new_not = All vertices in not that are connected to cur_v
        new_cand = All vertices in cand that are connected to cur_v
        new_compsub = compsub + cur_v
        cliqueEnumerate(new_compsub, new_cand, new_not)
        not = not + cur_v
        cand = cand - cur_v
        if there is a vertex v in cand that is not connected to fixp then
           cur_v = v
        else
           cur_v = NULL

This pseudocode is from the paper, A scalable, parallel algorithm for maximal clique enumeration


I have written a first version of this algorithm in Haskell but, as a baseline and because I wanted to test the new features of the new C++0x standard, I’ve written this algorithm in C++ making extensive use of the STL and the lambdas. I forgot how verbose C++ is but the addition of the keyword auto and the lambdas has somehow alleviated it a little.

void Graph::cliqueEnumerate(const vector<int>& compsub,
                     vector<int> cand,
                     vector<int> cnot,
                     vector<vector<int> >& result) const {

    // Function that answer whether the node is conected
    if(cand.empty()){
        if(cnot.empty()){
            // New clique found
            result.push_back(compsub);
        }
    } else{
        int fixp = findFixP(cand);
        int cur_v = fixp;

        while(cur_v != -1){
            vector<int> new_not;
            vector<int> new_cand;
            vector<int> new_compsub;

            // Auxiliar lambda function
            auto isConected =[cur_v,this](int x) {
                const vector<int>& edges = this->getEdges(x);
                return find(edges.begin(), edges.end(), cur_v) != edges.end();
            }; 

            // Compose new vector
            // Avoid performance bottlenecks by reserving memory before hand
            new_compsub.reserve(compsub.size()+1);
            new_not.reserve(cnot.size());
            new_cand.reserve(cand.size());
            copy_if(cnot.begin(),cnot.end(),back_inserter(new_not),isConected);
            copy_if(cand.begin(),cand.end(),back_inserter(new_cand),isConected);
            copy(compsub.begin(), compsub.end(), back_inserter(new_compsub));
            new_compsub.push_back(cur_v);

            // Recursive call
            cliqueEnumerate(new_compsub, new_cand, new_not, result);

            // Generate new cnot and cand for the loop
            cnot.push_back(cur_v);
            cand.erase(find(cand.begin(),cand.end(),cur_v));

            // Last check
            auto v = find_if(cand.begin(),
                             cand.end(),
                             [fixp,this](int x) {
                                const vector<int>& edges = this->getEdges(x);
                                return find(edges.begin(), edges.end(), fixp) == edges.end();
                             });

            // Obtain new cur_v value
            if(v != cand.end()) cur_v = *v;
            else break;
        }
    }
}

and the helper function findFixP is:

int Graph::findFixP(vector<int> cand) const {
    vector<int> connections;
    connections.resize(cand.size());

    // This is necessary for the set_intersection function
    sort(cand.begin(),cand.end());

    // Auxiliar lambda function
    auto intersection = [&cand, this](int x) -> int {
        const vector<int>& x_edges = this->getEdges(x);
        vector<int> intersection;

        set_intersection(x_edges.begin(), x_edges.end(),
                         cand.begin(), cand.end(),
                         back_inserter(intersection));
        return intersection.size();
    };

    // Create an auxiliar vector with the intersection sizes
    transform(cand.begin(),cand.end(),connections.begin(),intersection);

    // Find the maximum size and return the corresponding edge
    vector<int>::const_iterator it1, it2,itmax;
    int max = -1;
    itmax = cand.end();
    for(it1=cand.begin(),it2=connections.begin(); it1!=cand.end(); ++it1,++it2){
        if(max < *it2){
            max = *it2;
            itmax = it1;
        }
    }
    if(itmax == cand.end()) return -1;
    else return *itmax;
    }
}

For this function my first attempt was to write it using the std::max_element function, but I was worried that since the function we pass isn’t a transformation of the data but a comparison function (less), I was worried that on each comparison the set_intersection would be computed redundantly on each step.

There can be, for sure, room for improvement (any C++ guru in the audience?), but I’m pretty satisfied with the obtained implementation. It reads almost as the pseudocode. I think this is because I first wrote the Haskell version and the C++ has the functional flavor in it. Would I write first the C++ version and there would be for sure lots of nasty loops and array indexes.

Categories: Programming Tags: , ,