Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Waffles: command-line tools for machine learning and data mining (sourceforge.net)
105 points by rkda on Dec 26, 2011 | hide | past | favorite | 12 comments


This is nice, so is libsvm, and a few other comercial products I've used (JMP), but having some fairly non-trivial data sets for work, I've found myself implementing most of these by hand to get out the results I want. I'm not in an internet data mining field, most of my data comes from Semiconductors, so do these actually work well (as in 85%+ accuracy) on those type data sets?


This is highly problem dependent. libsvm will give some of the best results off-the-shelf, assume that your input features are sane. If not, you might need to do something more sophisticated. You can email me (see profile) if you want more information.


At the first glance: This could be complimentary tool to WEKA. It has some features that WEKA does not have. To name one: non-linear feature selection (e.g., Manifold Sculpting). It has extensive visualization libraries (2D and 3D). It can store sparse representation of data which is a huge memory saver for text mining (NLP). The biggest complaint would be that I don't see how this tool could do feature selection INSIDE cross-validation loop. It seems that authors are unaware that feature selection on the whole data set is prone to overfitting.

Note to self: It has mean margin trees but no SVM? Thread safe? Portable to R? C++ codebase. Why SHA?


The 'do one thing and do it well' doesn't necessarily mean command line programs. I would much prefer a binary -> library and bindings to various other languages through that library, the command line is a very sloppy integration tool compared to programming languages. Shogun-toolbox currently fits what I want except for the licensing.


From the link:

> Waffles apps are thin wrappers around functionality in a well-documented C++ class library.


Thank you, didn't see that.


This looks really useful. If anyone has used it, can you contrast it with Weka?


This is much nicer than weka in my opinion. Less bloat (not java, and doesn't have the full range of algoritms), more emphasis on difficult problems (non-linear) and practicality (random forests and neural nets and automating everything out). The main creator is working on practical computer vision and not machine learning research directly.

Weka is more suited for teaching than practical work.


There is a lot that bothers me about this post. First, that a library contains many different algorithms in a single framework is not a negative, nor bloat. Second, there is nothing about random forest or neural nets that are any more practical than maximum entropy or any other learning algorithm.

The main researchers are both doing university research work, not building products. I'm not sure where the practical/teaching arguments come in, Weka is used in many different scenarios, including commercial systems.

A brief look at the documentation suggests that this package is not nearly as extensive as Weka. Startups might care about the licence being LGPL instead of GPL. I can't comment on convenience and performance without using it a bit, but I've found that other command line driven packages are very easy to use for exploratory type research work.


I have used Weka in a commercial setting. Usually it is for data analysis and trying many ML techniques. Once I have my data cleaned up and analyzed, I will go an implement my own algorithm (in C). Thus Weka is immensely valuable as a analysis tool.



any potential bioinformatics use from this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: