Machine Learning (Theory)

9/28/2006

Programming Languages for Machine Learning Implementations

Tags: Language,Machine Learning jl@ 9:34 pm

Machine learning algorithms have a much better chance of being widely adopted if they are implemented in some easy-to-use code. There are several important concerns associated with machine learning which stress programming languages on the ease-of-use vs. speed frontier.

  1. Speed The rate at which data sources are growing seems to be outstripping the rate at which computational power is growing, so it is important that we be able to eak out every bit of computational power. Garbage collected languages (java, ocaml, perl and python) often have several issues here.
    1. Garbage collection often implies that floating point numbers are “boxed”: every float is represented by a pointer to a float. Boxing can cause an order of magnitude slowdown because an extra nonlocalized memory reference is made, and accesses to main memory can are many CPU cycles long.
    2. Garbage collection often implies that considerably more memory is used than is necessary. This has a variable effect. In some circumstances it results in no slowdown while in others it can cause a 4-order of magnitude slowdown. The basic rule here is that you never want to run out of physical memory and use swap.
    3. Some of these languages are interpreted rather than executed. As a rule of thumb, interpreted languages are an order of magnitude slower than an executed languages.
    4. Even when these languages are compiled, there are often issues with how well they are compiled. Compiling to a modern desktop processor is a tricky business, and only a few compilers do this well.
  2. Programming Ease Ease of use of a language is very subjective because it is always easiest to use the language you are most familiar with. Nevertheless, significant differences exist.
    1. Syntax Syntax is often overlooked, but it can make a huge difference in the ease of both learning to program and using the language. A good syntax allows you study and improve the algorithm rather than the program implementing it. (Algorithmic improvements often yield the greatest speedups while optimizing.) The syntax needs to be concise (so that you can view the entire algorithm) and simple (so that it can become second nature).
    2. Library Support Languages vary dramatically in terms of library support, and having the right linear algebre/graphics/IO library can make a task dramatically easier. Perl has a huge number of associated libraries which greatly ease use.
    3. Built in Functionality Languages differ in terms of the primitives that are available. Which of these primiitives are useful is a subject of significant debate.
      1. Some form of generic programming (templates, polymorphism, etc…) seems important.
      2. Functions as first class objects is often very convenient while programming.
      3. Builting lists and hash tables are often extremely useful primitives. One caveat here is that when you make a speed optimization pass, you often have to avoid these primitives.
      4. Support for modularity is important. Objects (as in object oriented programming) is one example of this, but there are many others. The essential quantity here seems to be an interface creation mechanism.
    4. Scalability Scalability is where otherwise higher level languages often break down. A simple example of this is a language with file I/O built in that fails to perform correctly when the file has size 231 or 232. I am particularly familiar with Ocaml which has the following scalability issues:
      1. List operations often end up consuming the stack and segfaulting. The
        Unison crew were annoyed enough by this that they created their own “safelist” library with all the same interfaces as the list type.
      2. Arrays on a 32 bit machine can have only 222-1 elements due to dynamic type information set aside for the garbage collector. As a substitute, there is a “big array” library. However, having big arrays, arrays, and strings, often becomes annoying because they have different interfaces for objects which are semantically the same.
    5. Familiarity This isn’t just your familiarity, but also the familiarity of other people who might use the code. At one extreme, you can invent your own language (as Yann LeCun has done with Lush). At the other extreme, you can use a language which many people are familiar with such as C or Java.

The existing significantly used machine learning code bases seem to favor lower level languages.

  1. Weka is one of the best known general purpose machine learning toolkits. It is implemented in Java and has far more algorithmic coverage than any other system I know of.
  2. LibSVM is implemented in C++ and Java.
  3. SVMlight is implemented in C.
  4. C4.5 is implemented in C.
  5. SNNS is written in C.

Both R and Matlab are also useful languages, although I have not used them.

None of the language choices seem anywhere near ideal. The higher level languages often can’t execute fast and the lower level ones which can are often quite clumsy. The approach I’ve taken is to first implement in a higher level language (my choice was ocaml) to test ideas and then reimplement in a lower level language (C or C++) for speed where the ideas work out. This reimplementation step is quite clumsy and I would love to find a way to avoid it.

48 Comments to “Programming Languages for Machine Learning Implementations”
  1. Mark Reid says:

    Good post! I thought I might add some observations from my experience. I’ve used three different languages to implement machine learning ideas: Java, Haskell and Prolog.

    I’ve found Java (5.0) to be a good language for commercial research as it’s fast, has a portable and full-featured platform, good development tools (i.e., Eclipse) and you can always find Java developers to help work on your codebase, unlike some other languages. As Weka demonstrates, Java seems to allow for solid development in a team environment too.

    On the downside, Java sits awkwardly between high-level and low-level languages. Generics, the collections framework and other features make life tolerable but, by and large, it still feels overly verbose and shows too much of its C heritage.

    Haskell a language you missed that is worth considering. It’s compiled and relatively fast (but does use a garbage collector I think). I also find Haskell code modular and easy to read. Also, because it’s functional, easily parallelised. Probably it’s biggest downside is its third-party libraries aren’t as extensive as some of the others you mentioned.

    I mainly used Prolog for my recently submitted thesis on inductive transfer for relational rule learning. It’s a very idiosyncratic language but once I’ve got my Prolog hat on I find I can implement and test new ideas very quickly.

  2. Orange seems like a very nice collections of ML tools, and it works with python. Most of the algorithms are however implemented in C, so speed is less of an issue. I’ve only used bits of orange briefly, but I find python to be a fantastic language for quick prototyping and “explorative programming”.

  3. Shane Legg says:

    I think java is a pretty good compromise. We first coded Weka in C and Tk/Tcl on Sun workstations. It was fast to run, but slow and complex to develop and there was always the potential for problems when running on a different OS, compiler etc. Java isn’t quite as fast, but it’s not all that much slower either. Perhaps half the speed. Overall I think the move to java was a very good decision.

    I think another good option is Python. It has the platform independence of Java, easy to read, good libraries etc. The main problem is speed as it is much much slower than C++ or Java. In my own tests it was about 100 times slower than C++. The solution is to develop and test the code in Python, and then take just the speed critical loops etc. and change them to a variant of Python called Pyrex. Essentially you add type information, change the for loop notation slightly and a couple of other things. Then you run Pyrex on this bit of the code and it outputs C source code that you can then compile to a library. The rest of your Python code can then use these functions like normal, except that now they are much faster. In tests I and other people here have done, the resulting code is only slightly slower than if you’d written the whole system in C to start with. To me this seems to be the best of both worlds; rapid development with an easy to read language, and then a fairly easy process to get it working as fast as plain C.

  4. To add to the comments on Python, Orange, etc, Python like most good interpreted/dynamic languages has an excellent foreign function interface, allowing the speed-critical parts of the code to be written (or rewritten) directly in C/C++. In addition, tools like Weave allow the C code to be inserted directly in the Python program.

    Also, if an algorithm can be rewritten primarily as matrix operations, a good BLAS/LAPACK-based matrix library (e.g. Numeric Python) will let you push the most time-consuming parts of the code (the matrix iterations) down into the fast fortran or C code of the library. This also helps somewhat with the problem of boxed floating-points because the whole matrix or vector is boxed instead of each element. With hardware-optimized BLAS implementations like ATLAS you can also then take advantage of fast SIMD instructions (e.g. MMX, AltiVec, etc).

  5. Jan Peters says:

    If one can rewrite the code in terms of matrix-vector operations with few loops, I have never seen anything beat MATLAB. It shares most of the advantages of the interpreted languages while having one of the best libraries of Math Code available. Only when you have many nested loops, MATLAB looses its power!

  6. I don’ agree with your take on garbage collection. Paul Wilson at UT Austin led a number of influential studies that showed that GC often improves performance, simply because a good garbage collector is better at keeping memory footprint small than manual allocation and deallocation (not to mention dangling point and memory leak bugs). Boxing and unboxing are not much of an issue when numeric data are kept in arrays. A more significant problem is that separately allocated objects, whether in C or Java, incur a significant memory overhead (various kinds of headers and allocation round ups). When we developed the FSM library for speech processing at AT&T Labs, we went to great lengths to store small objects contiguously in memory (in C). That’s one memory advantage of C and C++ over Java that doesn’t have anything to do with GC. Smaller memory footprint and contiguous allocation also improve speed by reducing the pressure on caches. Nevertheless, I’ve found that Java is the best choice around for structured classification problems that combine complex data representations (sequences, trees, graphs, feature maps) and a lot of linear algebra and convex optimization. C/C++ would have been much worse because of the difficulty in managing memory correctly and efficiently. Python or MATLAB would be too slow for the data structures part (yes, I know that Python’s built-in data types are quite efficient, but an interpreted language pays a big price on complex algorithms). Functional languages (OCaml, Haskell) are attractive because so many ML algorithms have natural functional formulations, but they have not been optimized for numerical computation, and some algorithms (for example dynamic programming algorithms) are difficult to implement efficiently in purely functional terms. In the end, Java with a good profiler is the best compromise for us, even though it is excruciatingly verbose. And we can call supper-efficient linear algebra codes via JNI if needed.

  7. Definitely an intersting post. Machine learning tools and methods can sure be developped in many languages. I am personnal a fan of Matlab; multiplatform, widespread, large and active community, etc. but I only deal with numerical data, and never with strings, trees, or other data types.
    Has anyone an advice on which languages to force undergrad or PhD students to learn in a ML course or for a thesis in ML ? Or is it maybe better to let the choice up to them ?
    BTW for those who are interested, I have compiled a list of machine learning toolboxes and software in delicious : http://del.icio.us/machinelearningsoftware. Feel free to suggest additions to the list !

  8. dc says:

    Although I am merely a machine learning spectator, I’d like to point out the Ruby NArray package which allows powerful matrix-vector manipulation with speeds just about equal to C (since Ruby extensions are written in C). The syntax and language features of Ruby make it much more expressive than say Python or (yech) Java, so if you write your code properly you can get compact, easy-to-maintain code with very good numeric speeds. Ruby’s support for functional and procedural programming without Java-style object kludges make it a more intuitive fit for many applications. Factor in the almost Lisp-ish ease of creating DSLs, and I think it’s definitely worth a look (especially once the bytecode-compiled Ruby 2.0 is released).

  9. jl says:

    I have difficulty understanding how GC can improve performance. My experience has always been that well written C code is significantly faster and with a smaller memory footprint than well written code in a GC language. My personal experience is in comparing Ocaml to C/C++, for example for the cover tree code. How can GC language hope to beat optimized C when optimized C could actually implement garbage collection?

    I can believe that quickly written code in a GC language can outperform quickly written C code. It’s just that in C you can keep optimizing longer.

    I agree about separate allocation of small memory objects. I have often found optimizing this to yield significant benefits.

    One of the personal difficulties I’ve run into with Java is that the state of Java on Linux has historically been poor.

  10. Read the Paul Wilson papers. Counterintuitive, but well documented in his work. Here are a few reasons I recall: 1) carefully written programs are typically conservative about deallocation to avoid dangling pointers, which means a bigger memory footprint; 2) explicit “retail” freeing is more expensive than wholesale reclamation during GC; 3) typical modern multispace GC algorithms compact allocated memory, reducing fragmentation and improving cache performance.

    As for Java on Linux, we have been using 64-bit versions of Java 1.4 and 1.5 quite successfully for several years, with no evidence of major problems. There are many things I don’t like about Java, but I know from experience that Java’s better type system and automatic memory management make me much more productive than in C/C++.

  11. jay says:

    It is actually shocking how little the developments in programming languages help machine learning. Garbage collection, object oriented programming, functional programming, none of it is much help. The only big win is having multidimensional arrays like R, Matlab, Numerical Python, or from the looks of it that Ruby thing mentioned above.

  12. Regarding the garbage collection speed debate, let me recommend “Quantifying the Performance of Garbage Collection vs. Explicit Memory Management by Hertz and Berger. This paper shows that garbage collection has important speed advantages when the heap size is much smaller than the total memory of the machine, but that explicit memory mangement (e.g. malloc) wins for applications that want to use all the memory that’s available.

    I notice that C# hasn’t been mentioned yet. C# has true VM-level generics (unlike Java 1.5’s compile-time generics) that allow you to use generic datastructures with primitive types without boxing. Plus, it has C-style value types and true multidimensional arrays. These features make it possible to do the kind of contiguous allocation that Fernando described earlier. Of course, there are downsides: the official implementation only works on Windows, and the unofficial one, while an amazing feat of engineering, is not 100% compatible.

  13. Matt Williams says:

    I have found that using Matlab and optimizing the slowest parts of code has worked very well for me. Matlab has a very easy to use profiler. It shows the time spent in each line of execution. By using the profiler, I was able to write about 100 lines of C code and now 80% of the computer time is spent running C code, and 95% of the code is in Matlab.

    I use Python as well for setting up a cluster of computers to run the Matlab simulations. I have found Python to be excellent and great for prototyping. However, it is slower than Matlab and would require more optimization. I am not sure if I will switch from Matlab to Python, but I am considering it.

  14. KT says:

    Some hints:

    1) If the your data is way bigger than RAM, check out mmap system call. That is what big databases use. If operating system is doing things right mmap is much faster than normal swapping. ps. mmap is really fast in Linux. If you read from file, adjusting buffer sizes for read functions in C can speedup your code 5-10 times.

    2) Good modern garbage collectors (generational/copying combinations) don’t use (necessarily) much more memory than manual memory management in C. If program runs long time, manual C style memory management fragments the memory. It has been proven that if allocated memory chunks are random in size half of the memory is lost because of fragmentation. You can’t defragment in C. 98% of garbage is short lived. Good GC uses generational GC frequently to get rid of these. When memory is exhausted, copying GC is used. Copying GC copies only live data. Dead data is not consuming time at all. Memory is automatically compacted and this means that big array allocations don’t fail as easily as in C (if the program runs hours/days). If you adjust copying GC to use 2X the RAM in your system, the other half of the memory is in swap and is not actually wasting RAM. Only rarely, when copying/compacting occurs it is touched again. Fast and clean.

    3) Check out the parameters for you GC. Same for malloc in C (you can allocate memory in many interesting ways in glibc). Usually GC parameters are set so that system response times are not too slow. Long batch jobs don’t need good real-time performance.

    4) You can make your own memory management in every language. Pool critical, frequently used data structures without releasing them.

    5) Check out Common Lisp. Commercial implementations from franz.com and lispworks.com are superb. SBCL, CMUCL, Corman Lisp (for windows). All these compile to native machine code.

    6) PEPITo runs on Lisp, Franz.com is selling it. http://www.franz.com/products/pepito/index.lhtml

  15. KaraNagai says:

    I am currently considering moving my SVM codebase from Java to C++. The goal of course is improved performance and better control over memory allocation. But I have my doubts too.

    First of all, I use the same codebase on both Linux (in a server environment) and Windows (on the desktop for computational experiments). And the codebase includes working with images and their transformations in the frequency domain. While I implemented it using JAI (Java Advanced Imaging) in Java, I am not sure if I can access some meaningful and easy to use cross-platform libraries for C++.

    Secondly, the same cross-platform requirement applies strict limitations on what I can use under Windows. Switching to C++ I would gladly use .NET 2.0 on Windows to gain more meaningful debugging, application UI and data storage at low cost in terms of development, but that ruins any attempts to run the same code on the Linux server, I think.

    Finally, the performance of the code will only grow by some fixed factor. And if you take into account that most of the algorithms, that require better performance are at least O(n^2), it would allow to handle just slightly bigger datasets. Most likely not even twice as big. I am quite uncertain here about the effect of memory allocation though.

  16. ingo says:

    One thing that I consider very important is development tool support. I quite like C++, and especially the generic programming style, but miss things such as a fast refactorings, test-driven development (yeah, you can do it, but waiting for the linker all the time is not funny), and so on.

    I’ve recently migrated some computer vision code from C++ to Java, just because of that. It’s mostly memory-bound anyway, so speed stayed the same and I figure I can always use gcj for production (though not testing) if performance becomes an issue.

    Interpreted languages such as Python and JavaScript are very nice for rapid prototyping and when run on top of the JVM or the CLI, everything available below is instantly usable, too. I believe the same is true for Matlab.

  17. GB says:

    Of related interest might be this year’s NIPS workshop on Machine Learning Open Source Software,

    http://www2.fml.tuebingen.mpg.de/raetsch/workshops/MLOSS06

  18. mikiobraun says:

    I think from this discussion here it’s pretty obvious that there doesn’t exist the programming language for machine learning applications. If you want to develop new algorithms, you’re perfectly fine with some interpreted high-level language like matlab which allows you to quickly test out some ideas, play around with data, and visualize things very easily.

    As your algorithms mature and you’re heading for some real-world applications, things like performance and scalability become more and more important. Maybe additional requirements arise like being able to talk to some data base back-end etc., or having to interface to some already existing system. Also some support for modern programming language concepts might be fine to ensure that the whole thing stays maintainable (have you ever tried to write something huge in matlab?).

    When it comes to raw performance, I think it’s really hard to beat C, but you should always take into account the time it takes to produce a bug-free program in C compared to the same thing in, e.g., python. Also, there is this folk wisdom that your productivity measured in lines of code per day is roughly the same independently of the programming language used, such that you will want to use a language which is as powerful and expressive as possible.

    Concerning interpreted languages like matlab, performance depends a lot on whether you’re using builtin-types a lot. For example, if your algorithm uses mostly matrix operations, the performance penalty for using matlab instead of C will be very small, as most of the time will be spent within highly optimized linear algebra routines. On the other hand, matlab will be very slow when you actually have to write loops and conditionals in matlab itself (think of graph algorithms, for example).

    This fact is reflected in another folk wisdom, namely that you’ll spend 90% of your time in 10% of the code. So, as others have already pointed out, you can get a nice compromise in terms of ease of use vs. performance by first developing in a language like python and then optimizing the “hot spots” by implementing parts in C.

    So, I’d like to see a language which can be used both for the early development stages, as well as the later stages, when performance and scalability become important. Since I like to play around during development, I clearly prefer interpreted languages like python or matlab over java or C++ for this early stage. For the later stages you then need strong “glueing” capabilities to replace interpreted code with compiled code, or to talk to other languages. I think things could still be a lot more flexible, but there is, for example, the swig tool which lets you speak to C from a large number of script languages and is already a big step in this direction.

  19. Here’s our variant of your 1) for Java. For dealing with large data, the nio package is essential. In parsing and information extraction we use perceptron-like algorithms running over training sets that after preprocessing may easily exceed the 12GB of memory in our typical servers (Opteron-based). We prepare the data, save it on disk, and we read it back k instances at a time for appropriate k. With careful encoding of the instances, nio, and standard buffered read-ahead in Linux, we can keep the processors fully employed.

  20. ingo says:

    @mikiobraun: While I concur with your statement that there is not the language, I have some comments with regard to the “interpreted early on, glue to low-level lang for performance” comments.

    Firstly, managed environments that support reflection, such as the Java Virtual Machine (JVM) or the .Net Common Language Infrastructure (CLI), make glue unnecessary. This means, for example, that I can access instantly any Java class ever written from within JPython without any intermediate steps such as wrapping. This is clearly a major step forward and makes use of multiple languages much easier.

    Secondly, the combination of high- and low-level language has quite a few implications. High-level languages often afford different coding styles (not syntax but data structure use and so on) that can translate badly to lower-level languages. I have often found it re-implementation from scratch to result both in more efficient code and in faster implementation. From the Matlab extensions my colleagues have written, it appears the same is true there.

    What this means, imho, is that sometimes “optimizing hotspots” can result in quite a bit of re-architecturing and re-writing as one encounters limitiations of the languages. For me, this has reduced the use of interpreted languages for algorithm implementations. I find that I am just as fast implementing directly in (e.g.) Java, when good supporting libraries are available and the development tools are helpful.

    This does not mean that interpreted languages are not usefull, but I use them mostly for a) one-off experiments, b) prototypes that are scratched and re-implemented later on and c) combining algorithms into a running system. In fact, item c) is where I think the largest value of dynamic languages can be found.

  21. mikiobraun says:

    Hello ingo,

    you raised a number of interesting points. Concerning the CLI, you are right that the “glueing” is implicit if every language is already implemented for the CLI, and the number of such languages has certainly increased in recent years. However, I think that the scope and potential of “glueing” is broader than just being able to talk to some other language, because this does not change the problem of differing interfaces and changes in design philosophy.

    For example, assume that you are coding some things in Java and there is some very helpful functionality implemented in Python. You can access the objects, classes and functions via the CLI, but you will have to adapt the code to the interfaces of the Python module, which means that you will have to deal with objects like lists, arrays, or matrices differently from how you would deal with Java implementations of these data types. This means that you can use code written in another language, but you have to put some work into it to really be able to talk to that code. But abstractly, since the underlying concepts are the same (lists, arrays, and matrices), it should in principle be possible to automate the process of adaption as well, leading to truely seamless glueing. (Of course, I’m not saying that this is easy, or trivial to get right and fast, I’m just talking about how things could be in an ideal world. ;) )

    Concerning the need to re-write things quite extensively once you move from a high-level language to a low-level language, I totally agree that this is anything but trivial and often requires also changing the design significantly because certain programming language concepts are not available. However, I think that this is inevitable, can be acceptable, and also pay off performance-wise nevertheless. I think it is inevitable, because different language support different programming styles best. It’s also acceptable, because the goal is also different when you are actually optimizing for performance, compared to writing high-level code. And finally, I think even given technology like just-in-time compilation, you might still get out a factor of two if you switch to a language like C, which might or might not be necessary.

    Again, I’m not even sure others will have made the same experiences, because everything depends a lot on the type of data, application, and scale you are dealing with. From my personal experience, I’ve found that different languages have different strength, and I just would like to see more automated support for moving between these different levels, but as always, your mileage may vary.

  22. DrewBagnell says:

    Hi Fernando,

    I find myself stuck in C++ land with a bit of Matlab for visualization/simple stuff. I can’t stand the verboseness you point out about Java, but it does seem to have some signicant advanatages in terms of GC, portability, great libraries, etc… What do you do for linear algebra code these days? Even in C++ this is pretty weak, but it’s very easy to bail out to C. Do you just use JNI, or is there a linear algebra package you’re fond of?

    How about visualization? When I spent a lot of time in ML, I shelled out to Matlab for graphics (which is rather unfortunate). Is there a nice graph making utility in Java? Including for 3D plots?

    Drew

  23. Paul Tulloch says:

    Just a quick point on this discussion.

    I am currently in the midst of writing a research paper on the economics of innovation and our journey towards a Knoweldge Based Economy in which I have been focusing a lot on the issues of machine learning. I fundementally envision a future in which the issue you are addressing here is one of the fundemental hurdles we are facing with regards to advancing ML along a path towards a fundemental embedding within the production process of a large chunk of our value adding productive capacity. Machine learning and it’s role in our future is currently only held up by our imagination with a special enphasis on design in the big. I believe we are currently in the infancy of these ML technologies and resulting transformations and when we start approaching some convergencies within a select group of technologies we will build up ML in ways that the above discussion will only be a mere signpost for the future to look back upon with the gaze of wonderment. We will need within this convergency to develop a programming language that will far surpass the languages that we have now with specificity on ML design at its core. I believe we will also see the design of that language spring boarding off some newly designed hardware, specifically for ML. This of course is raely spoke of but it will be soon as many of the current scalability problems catch up with up Buzz being creating in cultural spaces outside of the hard core ML literature. This is being fueled by two general dynamics. One is the increasingly successful applications of ML in practical settings, and the associated cost savings of these systems. It is generally, I would say, still not well understood within the practical space of datawarehouses, commercial interests, and IT. However it is getting more air play on a daily basis. This will meet up against the limitations in scalability problems for many of the ML algorithms, espiecally some of those showing the most promise such as SVMs. There will be a long look by the design in the “big” creative spaces which will be fundamentally motivated by the cost savings. These roadblocks in scalability are both hardware and software related and there will be a new paradigm developed. We are merely playing with toys currently, if we ever get to the future. A further aspect of the convergence I am speaking of will be in the functioanlity of ML. That is, it will need to become deskilled. Through the notion of what I have called the design of “Smarter Machines” and Human Computer Interactionist schools, we will see the deployment of ML based information systems that the averge knowledge worker of the future will be able to implement and maintain with ease.

    These are just visions that I see with my economist/information specialist eyes. I am hoping to write a fairly lengthy article on the notions of the coming convergence with the aim of outlineing the above ideas and expanding on them. I am not sure who reads this lising, but if you can see that vision and would like to help with formalizing these notions and constructs I would be happy to discuss and include your comments within my work.

    email me with your thoughts.

    Paul

  24. mithrond says:

    http://aclweb.org/aclwiki/index.php?title=List_of_NLP/CL_courses
    it’s only approximate:)

    59 ['perl']
    56 ['prolog']
    42 ['java']
    27 ['python']
    25 ['c++']
    21 ['lisp']
    17 ['c']
    6 ['student', 'choice']
    2 ['foxpro', 'oz', 'matlab']
    1 ['mozart', 'tdl', 'languages', 'awk']

  25. Will Dwinnell says:

    MATLAB is my tool of choice. Reasons?

    One reason MATLAB is so convenient for this sort of work is that arrays are primitive data types in MATLAB and linear algebra is naturally implemented. Creating neural networks in MATLAB is exceedingly easy, even without the Neural Network Toolbox. To me, the following code typifies why MATLAB is such a natural choice for this work, which fires a feedforward neural network:

    HiddenLayer = tanh(InputLayer * HiddenWeights);
    OutputLayer = tanh(HiddenLayer * OutputWeights);

    Goodbye, loops! Note that the above code is for all exemplars, not just one!

    I use pure MATLAB (no “speeded up” parts written in other languages) in my data mining work at a bank. My most recent modeling project involved about 200,000 observations and over 100 candidate predictors.

    Caveats: 1. I tend to load all data to RAM. 2. MATLAB has an unfair reputation for being slow. 3. Still, it’s always nice to have heavy-duty hardware. See the following for more of my thoughts on the matter:

    -Will Dwinnell
    http://will.dwinnell.com

  26. Nathan Neitzke says:

    I am currently working on a Machine Learning suite in .NET. The advantages are that it is very fast (equivilent to C++ in the very performance critical areas) and provides a multitude of language options. For example I write code in C#, C++, F#, and IronPython (Python on .NET) that all works together.

    Most interesting to note is F#. It is basically a functional programming language (with some imperative constructs added for good interop) built on top of .NET and is very powerful. One of the largest benefits is having access to the massive .NET library and especially managed directx for visualization. It was developed at Microsoft Research and is freely available. Here are some links –

    F# – http://research.microsoft.com/fsharp/fsharp.aspx
    F# Visualization Demonstration – http://channel9.msdn.com/Showpost.aspx?postid=234889
    MS Research Machine Learning & F# –
    http://channel9.msdn.com/Showpost.aspx?postid=237064

    • Yin Zhu says:

      the speed of numerical algorithms in a managed language is simply impossible to catch up with c/c++.

      in my experience, .net managed code is 2 to 4 times slower than c/c++. although performance between c# and F# is not significant.

  27. lh says:

    Hi mikiobraun. I find that your comments are sensible and based on real experience. I totally agree with your several points:

    (1) In terms of speed, few (if not none) programming languages can beat C.
    (2) At the early stage, use MATLAB to test the ideas. When the ideas work
    work well, rewrite the matlab code using C to produce an application
    program.

    What I want to add are:

    (1) C is a really great language in terms of speed as well expressiveness. I can even say C is a beautiful language. An expert in C will find it easy to do memory management efficiently. Remember, most of the operating systems are written in C. It seems to me I don’t need other “fancier OOP” languages such as C++, JAVA, or Python. I am not a fan of OOP languages because their reference books are too heavy!

    (2) GSL (GNU Scientific Library) provides a good set of functions for matrix operations, numerical analysis, etc.

  28. [...] Choice of Language. There are many arguments about the choice of language. Sometimes you don’t have a choice when interfacing with other people. Personally, I favor C/C++ when I want to write fast code. This (admittedly) makes me a slower programmer than when using higher level languages. (Sometimes I prototype in Ocaml.) Choosing the wrong language can result in large slowdowns. [...]

  29. Symbolware says:

    I favor C#. It’s very fast, while has much strength that Java has. Besides, .NET is becoming a cross-platform technology, e.g. Silverlight by Microsoft.

  30. Jon Harrop says:

    You should value correctness. For example, one comment claims that C++ is faster than OCaml and cites only C++ code and not OCaml code and that C++ code just segfaults on my machine. Allegedly faster but wrong isn’t very useful…

  31. Abhishek Ghose says:

    Python with numpy seems to nicely combine ease-of-use with speed.

  32. [...] Code Search: http://www.google.com/codesearchhttp://www.jstatsoft.org/http://hunch.net/?p=230This answer .Please specify the necessary improvements. Edit Link Text Show answer summary [...]

  33. Ted Sandler says:

    This thread combines two of my great loves – ML and PL. I’m somewhat surprised that there hasn’t been more work on designing languages that hit the sweet spot for numerical computing. The PL folks design wonderful languages for their own purposes — theorem proving and compiler writing. It would be great if someone focused on creating an ML-style language designed from the ground up for fast numerical computing. I heard some time back that Guy Steele was working on an engineering-oriented computer language called “Fortress”. I wonder what happened to that project. I also wonder if anyone here has used the new “Go” language and if they have any opinions about whether it fixes some of the warts that make programming in C/C++ unattractive.

  34. vlad says:

    I find the comments above are interesting, however the choices and options in my view would be rated as ‘C-‘ in terms of being future proof.

    a) the comments about C++ not being cross platform or requiring memory management — are based on C++ of 15 years ago. Modern C++ is a combinatin of C++ standard language, Boost library and Qt Library. The 3 together allow sophisticated cross-platform, memory-management-free programming models that include various types of primitive still threading models, matrix algebra (BLAS is supported within the above), visualization, graph and other basic tasks.

    b) The direction however is to use massively parallel programming models either on the same host (using CPU/GPU) or beter yet horizontally scalable across multiple machines.

    value of pure functional languages (haskell, clojure) is that they force ‘stateless programming’ — that in turn allows for actor-model or message-based concurrency. So definetely having them defining the programming model is appropriate

    Unfortunately the VM based languages are mostly
    a) not compatible with GPU progamming model (OpenCL)
    b) do not provide any facilities for multi=host programming models (share nothing or shared memory)

    C++ Boost is integrated with MPI wich helps some, but still.
    Erlang in my view would have had much better future in this space if not for the slow numerics

    Perhaps at least multi-host programming model will be sovled by Operating systems designed for SSI (Single System Image model) — dragon fly BSD is going in that direction — but very early stages

    So basically for now I would stick with C++ Boost Qt

  35. ErikB says:

    Sorry, but that post is really disappointing and lacks any new information, even for me. And I am a guy who has so little knowlegde of ML that I wouldn’t even call myself a beginner.

    I am really disappointed, because the headline is of a very deep interest to me, so I want to name some things that really are lacking here:
    – even I know that Matlab is the language used very often for at least prototyping mathematical models for computational problems. The same is true for R and ML problems. That you don’t know both means that you are probably really not very good at ML. Mind that I don’t say you aren’t. I just say the chance is very, very low that you are, having the information that you don’t know R and Matlab, knowing that P(knows-R|ML-expert) and P(knows-Matlab|ML-expert) are both very high.

    – most problems you state are not domain specific for ML. In fact issues like speed are a problem in all domains, which is also the reason that all domains use more and more C when the complexity of a problem grows to the limits of todays machines. Beside from the direct drawbacks of this issue it also is an additional sign that you don’t know much about ML and probably also not very much about programming at all.

    – you don’t state a source (or even an argument) for the assumptions that the frameworks you named are in fact really the most often used frameworks in ML. Which in combination with the first point I made actually infers a good chance that your statement might be wrong at least in some details

    – your arguments really miss any depth at all. You could have analised the languages and frameworks according to how well they are able to model a given typical ML problem, i.e. speech recognition. Also you could have analyised how often which language and framework is used in papers of topic X (isn’t that an ML task itself, “given all papers on topic x, which of the given frameworks is used most often”?) and gave us a real overview, concluding from data, not from opinion. (which is another mistake often done by novices, especially in a data driven domain like ML) These all are just examples, of how to put some depth behind some argument or observation.

    – there is no clue of why you thought you need to write down this statement now, who you are and what you did to get to that point

    Do you see, why I don’t get any new information out of this post or why I can’t learn anything from your arguments and the given list of frameworks?
    I really hope you have the time and energy to imrpove this article. As I said the topic is of deep interest to me and probably many other people interested in learning about ML.

    • Steve says:

      This is an awesome comment.

      “””That you don’t know both means that you are probably really not very good at ML. Mind that I don’t say you aren’t”””

      Are you sure you know who the author of this blog is?

    • Anonymous says:

      fyi, you sound like a douchebag. Mind that I don’t say that you are, just that the chances are very, very high.

  36. Hugo Penedones says:

    To me, the ideal is having a library in which the computational intensive algorithms are implemented in an efficient low-level language, and then you have an interface in a simple scripting language.
    A good example of this is torch5: the API is in lua, but the algorithms are written in C.

    You get the best of both worlds: “Speed” and “Programming Ease”.

  37. rodrigob says:

    I agree that there is need for a better (open source) scientific computing platform. Python + C/C++ or Lua + C/C++ are ok, but not good enough.

    Flying Frog tried to tackle this problem by creating a new VM specifically tailored.

    http://www.ffconsultancy.com/ocaml/hlvm

    the idea seemed quite interesting, however the project seems cold. Being open source I wonder if people would be interested in reviving it.

    The benefits of having a VM is that it allows using multiple syntaxes (languages) for different parts of the project (a la .Net).

  38. There is a relatively new open source project that’s loosely affiliated with MIT, developing a new programming language for scientific and numerical computing (machine learning is very much within the scope), called Julia. The motivation for creating the language is very well summed up in this blog post and the responses to it. Here’s the web page in case anyone wants to check it out: http://julialang.org/.

    To describe Julia very briefly, it is a bit like a Lisp with math-friendly Matlab-like syntax, including the ability to do high-level linear algebra trivially, really great C interoperability for using things like LAPACK or FFTW, and native performance that’s equal to or close to C — no need to vectorize all your code for speed like in Matlab, NumPy or R — you can just write a for loop and it will be fast and memory efficient.

  39. [...] Langford has a nice wiretup discussing various aspects important to the choice of language for machine learning. Like [...]

  40. Ben Racine says:

    I predict Julia will become the dominant name in this field some day.

  41. Newbie says:

    How about those programming languages that compile as they run? They utilize the syntactical ease of a language like Python, Ruby, or ocaml, and can run it to almost the speed of languages like C. Would they be good to use for AI?

  42. [...] Langford has a nice wiretup discussing various aspects important to the choice of language for machine learning. [...]

Sorry, the comment form is closed at this time.

Powered by WordPress