Archive

Posts Tagged ‘research’

Implementing LDA

November 28th, 2011 No comments

Lately I was playing with Latent Dirichlet Allocation (LDA) for a project at work. If for whatever reason you need to implement such algorithm perhaps you will save some time reading the walkthrough I did.

First you must be sure that LDA is the algorithm you are looking for. From a corpus of documents you will get K lists with words from your documents in them with a number assigned to each word denoting the relevance of the given word in the lists. Each list represents a topic, and that would be your topic description, no fancy words like “Computers”, “Biology” or “Life meaning”, just a set of words that a human must interpret. You could always assign a single name by picking the most prominent word in the list or treating the list as a valued vector and comparing it against a canonical topic description. So take a look at the first examples in this presentation and get inspired.

OK so you need some code to test how this method behaves with your particular data. The first thing to try is the topicmodels package from the R   statistical software package. This can give you an idea of the method and try to use it in a more serious Java application by means of the Mallet library.

But say that you need to create your own implementation because Java horrifies you or because you need a parallel version or whatever the reason. The first thing you have to do is to choose the inference method of your model between variational methods or gibbs sampling. This post will give you some ideas for picking the right method for your particular problem. The original papers picked the variational approach but I went through the Gibbs sampling method because I found this paper where all the mathematical derivations are nailed down. That way I was able to fully understand the method and at the same time being sure that my implementation was right and sound.  If you need more guidance, take a look at this simple implementation for getting an idea of the main functions and data structures you’ll have to code.

Once you have your code written you will have to check whether it is correct or not. The example in this paper using pixel positions and pixel intensities instead of words and word counts is very illustrating and will show visually the correctness of your implementation. Once you have your algorithm up and running perhaps you want to scale it up to more machines, so you could benefit from reading this paper  and taking also a look at this blog post from Alex Smola and their distributed implementation of LDA on Github.

Happy coding!!!

People I will closely follow in 2011

November 29th, 2010 No comments

No one works in isolation. Everybody is looking for inspiration, ideas or exemplifying careers in his field. Specially in the technological field, we stand on the shoulders of giants. Those who follow are my giants for the next year, they are programmers, researchers or professors and they are working on one field of interest to me. Here they are:

Daniel Lemire

Daniel Lemire is a Canadian university professor working in the field of Recommender Systems. I first heard about him when researching high performance recommender systems I stumbled upon his simple yet effective Slope One recommender algorithm. However, lately he has been actively discussing in his blog about the inner workings of the scientific community, the peer review process and the validity of the University model as the best way to disseminate scientific knowledge in an age with zero costs for accessing information. I find those posts very stimulating.

David MacKay

David MacKay is a professor at the Inference Group in the University of Cambridge. His book about Information Theory is one of the best written ones I have read. It exposed me to Gaussian Processes in a clear and understandable way for the first time. But not only for his research quality, David is remarkable because is a professor that does not live in his ivory tower. He is able to apply his research to help the impaired with the Dasher project. Data mining and information theory for the greater good. And as an extra bonus his interest on tackling the global warming from a scientific point of view is very inspiring and necessary in an age when corporations dictate agenda. And all his publications freely accessible to everyone.

Michael I. Jordan

Michael I. Jordan is a difficult search term in the Internet :) . He is a professor in Berkeley working on machine learning and one of the first proponents of Bayesian Networks as learning models. Although I don’t know his work very well, a friend, whose opinion I value a lot, recommended him as the Midas of research. Everything he publishes becomes a hit after a couple of years. So a sure bet to ride.

Bradford Cross

Bradford Cross is the only non academic in this list. He is co-founder and head of research of FlightCaster, a small startup that predicts flight delays using Statistical Learning techniques. He presents the right balance between engineering and research that appeals to me the most. He can write posts about using Clojure for Mining the web or using it for managing Hadoop and Cascade instances or write excellent posts about the best way to self-learn Statistical Learning Theory.

Yann LeCun

Yann LeCun and Geoffrey Hinton are two researchers that started in the early 90′s doing work on Neural Networks.  If you have taken any course on Neural Networks you probably have seen mentioned the 10 digit character recognition dataset and how NN are used for learning it. This dataset was compiled by LeCun. In this decade, they have realized the limitations of the Neural Network architectures and they are now working in a new area of Machine Learning called Deep Learning. In deep learning architectures, the learning agents are chained in layers that work with low level data to provide information to upper layers, which in time are using this information as low level input. The ambitious goal is to generate true Artificial Intelligence, so worth checking if they succeed or not.

Jürgen Schmidhuber

Marcus Hutter

Jürgen Schmidhuber and Marcus Hutter. Hutter was working under the supervision of the Schmidhuber in a Theory of Universal Learning Machines that tries to unify Algorithmic Learning Theory with Reinforcement Learning. As the name points, this theory aims at building a universal problem solver and thus true AI. Marcus’ book has been sitting in my reading pile for a long time. It is a difficult read because the book rapidly moves into layers of abstraction when it is supposed to be explaining learning algorithms, but anyway I feel a strange attraction to this book and I will surely read it in the future.

So this is my Hall of Fame for 2011. Lots of reading and learning.

Categories: Uncategorized Tags: ,