Archive

Archive for the ‘Data Mining’ Category

Implementing LDA

November 28th, 2011 No comments

Lately I was playing with Latent Dirichlet Allocation (LDA) for a project at work. If for whatever reason you need to implement such algorithm perhaps you will save some time reading the walkthrough I did.

First you must be sure that LDA is the algorithm you are looking for. From a corpus of documents you will get K lists with words from your documents in them with a number assigned to each word denoting the relevance of the given word in the lists. Each list represents a topic, and that would be your topic description, no fancy words like “Computers”, “Biology” or “Life meaning”, just a set of words that a human must interpret. You could always assign a single name by picking the most prominent word in the list or treating the list as a valued vector and comparing it against a canonical topic description. So take a look at the first examples in this presentation and get inspired.

OK so you need some code to test how this method behaves with your particular data. The first thing to try is the topicmodels package from the R   statistical software package. This can give you an idea of the method and try to use it in a more serious Java application by means of the Mallet library.

But say that you need to create your own implementation because Java horrifies you or because you need a parallel version or whatever the reason. The first thing you have to do is to choose the inference method of your model between variational methods or gibbs sampling. This post will give you some ideas for picking the right method for your particular problem. The original papers picked the variational approach but I went through the Gibbs sampling method because I found this paper where all the mathematical derivations are nailed down. That way I was able to fully understand the method and at the same time being sure that my implementation was right and sound.  If you need more guidance, take a look at this simple implementation for getting an idea of the main functions and data structures you’ll have to code.

Once you have your code written you will have to check whether it is correct or not. The example in this paper using pixel positions and pixel intensities instead of words and word counts is very illustrating and will show visually the correctness of your implementation. Once you have your algorithm up and running perhaps you want to scale it up to more machines, so you could benefit from reading this paper  and taking also a look at this blog post from Alex Smola and their distributed implementation of LDA on Github.

Happy coding!!!

Getting started with JAGS for Bayesian Modelling

September 13th, 2011 No comments

In the past if you wanted to do some Bayesian Modelling using a Gibbs sampler you had to rely on Winbugs but it isn’t open source and it only runs in Windows. The JAGS project started as a full feature alternative for the Linux platform. Here are some instructions for getting started

First install the required dependencies. In my Ubuntu 11.04, is something like:

sudo apt-get install gfortran liblapack-dev libltdl-dev r-base-core

For Ubuntu 11.04 you have to install JAGS from sources, but it seems that this version will be packaged in the next Ubuntu release. Download the software from here and install.

tar xvfz JAGS-3.1.0.tar.gz
cd JAGS-3.1.0
./configure
make
sudo make install

Now fire R and install the interface package rjags

$ R
...
> install.package("rjags",dependencies=TRUE)

Now let's do something interesting (although pretty simple). Let's assume we have a stream of 1s and 0s with an unknown proportion of each one. From R we can generate such distribution with the command

> points <- floor(runif(1000)+.4)

that generates a distribution with roughly 40% of 1s and 60% of 0s. So, our stream consists of a sequence of 0s and 1s generated using the uniform(phi) distribution where the phi parameter equals 0.4.

If we don't know this parameter and we try to learn it, we can assume that this parameter has prior uniform in the range [0,1] and thus the model that describes this scenario in the Winbugs language is

model
{
    phi ~ dunif(0,1);
    y ~ dbin(phi,N);
}

In this model N and y are known, so we provide this information in order to estimate our unknown parameter phi. We create the model and query the resulting parameter distribution:

> result <- list(y=sum(points), N=1000)
> result
$y
[1] 393

$N
[1] 1000

> library(rjags)
Loading required package: coda
Loading required package: lattice
linking to JAGS 3.1.0
module basemod loaded
module bugs loaded
> myModel <- jags.model("model.dat", result)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 5

Initializing model

> update(myModel, 1000)
  |**************************************************| 100%
> x <- jags.samples(myModel, c('phi'), 1000)
  |**************************************************| 100%
> x
$phi
mcarray:
[1] 0.3932681

Marginalizing over: iteration(1000),chain(1) 

>

So the inferred value of phi is 0.3932. One interesting thing in Bayesian statistics is that it does not estimate points, but probabilistic distributions over the parameters. We can see how the phi parameter was estimated by examining the Monte Carlo Chain and the distribution of the generated values during the simulation

> chain <- as.mcmc.list(x$phi)
> plot(chain)


Where we can see that the values for phi in the chain were centered around the 0.4, the true parameter value.

Categories: Data Mining Tags: , , ,