shogun | A Large Scale Machine Learning

shogun | A Large Scale Machine Learning Toolbox
Shogun - A Large Scale Machine Learning Toolbox
This is the official homepage of the SHOGUN machine learning toolbox.
The machine learning toolbox's focus is on large scale kernel methods and
especially on Support Vector Machines (SVM)
[1]
. It provides a generic SVM
object interfacing to several different SVM implementations, among them the
state of the art
OCAS
[21]
Liblinear
[20]
LibSVM
[2]
SVMLight
[3]
SVMLin
[4]
and
GPDT
[5]
. Each of the SVMs can be
combined with a variety of kernels. The toolbox not only provides efficient
implementations of the most common kernels, like the Linear, Polynomial,
Gaussian and Sigmoid Kernel but also comes with a number of recent string
kernels as e.g. the Locality Improved
[6]
, Fischer
[7]
, TOP
[8]
, Spectrum
[9]
Weighted Degree Kernel (with shifts)
[10]
[11]
[12]
. For the latter the efficient
LINADD
[12]
optimizations are implemented. For linear SVMs the COFFIN framework
[22]
[23]
allows for on-demand computing feature spaces on-the-fly,
even allowing to mix sparse, dense and other data types.
Furthermore, SHOGUN offers the freedom of
working with custom pre-computed kernels. One of its key features is the
combined kernel
which can be constructed by a weighted linear combination
of a number of sub-kernels, each of which not necessarily working on the same
domain. An optimal sub-kernel weighting can be learned using
Multiple Kernel Learning
[13]
[14]
[18]
[19]
Currently SVM one-class, 2-class and multiclass classification and regression problems can be dealt
with. However SHOGUN also implements a number of linear methods like Linear
Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel)
Perceptrons and features algorithms to train hidden markov models.
The input feature-objects can be dense, sparse or strings and
of type int/short/double/char and can be converted into different feature types.
Chains of
preprocessors
(e.g. substracting the mean) can be attached to
each feature object allowing for on-the-fly pre-processing.
SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python and is proudly released as
Machine Learning Open Source Software
We took part in Google Summer of Code 2011
Thanks to the work of 5 hard working and talented students, we now have various new features implemented in shogun: Interfaces to new languages like java, c#, ruby, lua written by Baozeng; A model selection framework written by Heiko Strathman, many dimension reduction techniques written by Sergey Lisitsyn, Gaussian Mixture Model estimation written by Alesis Novik and a full-fledged online learning framework developed by Shashwat Lal Das. All of this work has been integrated in shogun 1.0.0. In case you want to know more about shogun
check out the
documentation
and read our overview paper:
Soeren Sonnenburg, Gunnar Raetsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien,
Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtech Franc.
The SHOGUN Machine Learning Toolbox
. Journal of Machine Learning Research, 11:1799-1802, June 2010.
Screenshots
As everyone likes screenshots, we have produced one for each interface:
SHOGUN with Octave, Matlab, Python and R. Click on the link for higher resolution
images.
Getting Started
Applications
We have successfully used this toolbox to tackle the following sequence
analysis problems: Protein Super Family classification,
Splice Site Prediction
[10]
[15]
[16]
, Interpreting the SVM Classifier
[13]
[14]
Splice Form Prediction
[10]
, Alternative Splicing
[11]
and Promotor
Prediction
[17]
. Some of them come with no less than 10
million training examples, others with 7 billion test examples. A graphical example is written digit recognition as shown below:
Licensing Information
Except for
SVMLight
which is (C) Torsten Joachims and follows a different licensing scheme
(cf. LICENSE.SVMLight in the tar achive) SHOGUN is licensed under the
GPL version 3 or any later version (cf. LICENSE).
Cite us
If you use SHOGUN in your research you are kindly asked to cite the following paper:
Soeren Sonnenburg, Gunnar Raetsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien,
Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtech Franc.
The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, 11:1799-1802, June 2010.
Download Releases
SHOGUN Version 2.1.0 (lib 13.0, data 0.5, param 1)
(updated 17.03.2013)
Source Code (
ftp
http
Source Code md5sum (
ftp
http
Source Code PGP Signature (
ftp
http
Data (
ftp
http
Data md5sum (
ftp
http
Older Versions
This release also contains several enhancements, cleanups and bugfixes:
Features:
Linear Time MMD two-sample test now works on streaming-features, which allows to perform tests on infinite amounts of data. A block size may be specified for fast processing. The below features were also added. By Heiko Strathmann.
It is now possible to ask streaming features to produce an instance of streamed features that are stored in memory and returned as a CFeatures* object of corresponding type. See CStreamingFeatures::get_streamed_features().
New concept of artificial data generator classes: Based on streaming features. First implemented instances are CMeanShiftDataGenerator and CGaussianBlobsDataGenerator. Use above new concepts to get non-streaming data if desired.
Accelerated projected gradient multiclass logistic regression classifier by Sergey Lisitsyn.
New CCSOSVM based structured output solver by Viktor Gal
A collection of kernel selection methods for MMD-based kernel two- sample tests, including optimal kernel choice for single and combined kernels for the linear time MMD. This finishes the kernel MMD framework and also comes with new, more illustrative examples and tests. By Heiko Strathmann.
Alpha version of Perl modular interface developed by Christian Montanari.
New framework for unit-tests based on googletest and googlemock by Viktor Gal. A (growing) number of unit-tests from now on ensures basic funcionality of our framework. Since the examples do not have to take this role anymore, they should become more ilustrative in the future.
Changed the core of dimension reduction algorithms to the Tapkee library.
Bugfixes:
Fix for shallow copy of gaussian kernel by Matt Aasted.
Fixed a bug when using StringFeatures along with kernel machines in cross-validation which cause an assertion error. Thanks to Eric (yoo)!
Fix for 3-class case training of MulticlassLibSVM reported by Arya Iranmehr that was suggested by Oksana Bayda.
Fix for wrong Spectrum mismatch RBF construction in static interfaces reported by Nona Kermani.
Fix for wrong include in SGMatrix causing build fail on Mac OS X (thanks to @bianjiang).
Fixed a bug that caused kernel machines to return non-sense when using custom kernel matrices with subsets attached to them.
Fix for parameter dictionary creationg causing dereferencing null pointers with gaussian processes parameter selection.
Fixed a bug in exact GP regression that caused wrong results.
Fixed a bug in exact GP regression that produced memory errors/crashes.
Fix for a bug with static interfaces causing all outputs to be
1/+1 instead of real scores (reported by Kamikawa Masahisa).
Cleanup and API Changes:
SGStringList is now based on SGReferencedData.
"confidences" in context of CLabel and subclasses are now "values".
CLinearTimeMMD constructor changes, only streaming features allowed.
CDataGenerator will soon be removed and replaced by new streaming- based classes.
SGVector, SGMatrix, SGSparseVector, SGSparseVector, SGSparseMatrix refactoring: Now contains load/save routines, relevant functions from CMath, and implementations went to .cpp file.
Documentation and Examples
We use
Doxygen
for both
user and developer documentation
which may be read online
here.
More than 600
documented
examples for the interfaces
python_modular
octave_modular
r_modular
static python
static matlab and octave
static r
static command line
and
C++ libshogun developer interface
can be found in the
online documentation
In addition, examples are shipped in the
examples/(un)documented/[interface]
directory in the source code (where interface is one of r, octave, matlab,
python, python_modular, r_modular, octave_modular, cmdline, libshogun).
Installation
Screenshots
Tutorial
Examples
Implemented Methods
Interfaces
Frequently Asked Questions
Authors
License
Chinese
安装
使用快照
使用指南
例子
机器学习方法
接口
常见问题
作者
许可证
Note that documentation for python-modular is most complete and also that python's help function will show the documentation when working interactively:
$ python
Python 2.4.4 (#2, Jan 3 2008, 13:36:28)
[GCC 4.2.3 20071123 (prerelease) (Debian 4.2.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from shogun.Classifier import SVM
>>> help(SVM)

class SVM(CSVM)
| Method resolution order:
| SVM
| CSVM
| CKernelMachine
| Classifier
| SGObject
| __builtin__.object
| Methods defined here:
| __init__(self, kernel, alphas, support_vectors, b)
[...]
Below we provide some of the (in the meantime outdated)
examples that were used to carry out experiments
for a number of publications. Note that more than 600 examples and updated
versions of all of these can also be found in the source code and in the
online documentation
Click on the corresponding link to see classification and regression examples for Matlab(tm), R, Octave or Python:
Below one finds some Bioinformatics examples (for octave and matlab) as presented at BOSC 2006:
Multiple Kernel Learning examples (JMLR 2006 paper "Large Scale Multiple Kernel Learning"):
R Interface Examples
Octave Interface Examples
Python Interface Examples
Matlab Interface Examples
Spectrum Kernel
Weighted Degree Kernel
Weighted Degree Kernel with Shifts
MKL for classifying x-mas stars
MKL for regression
MKL for mixture of sine waves
Publications and Presentations
We have presented shogun at numerous occassions and provide additional material below
useR!2006 -
Slides
ISMB / BOSC 2006 -
Slides
useR!2010 -
Slides
MLOSS 2010 -
Video Lecture
Slides
Europython 2010 -
Video Lecture
Slides and Demo
Scipy 2010 -
Video Lecture
Slides
JMLR 2010 -
Supplementary Material
Bug-Reports, Mailinglist, Planet
In case you find bugs or have feature requests please use the
github issue tracker
. Check the
buildbot
for current issues.
Alternatively use the mailinglist (subscription required) if you have comments, problems or questions etc.
read the archive
the mailinglist is archived using gmane
shogun-list-archive
to the mailinglist by sending an empty message to
shogun-list-subscribe@shogun-toolbox.org
unsubscribe
from the mailinglist by sending an empty message to
shogun-list-unsubscribe@shogun-toolbox.org
post
to the mailinglist by sending a message to
shogun-list@shogun-toolbox.org
We have set up
shogun planet
for related blogs and blogs of developers.
IRC and Contact
You can chat with us via IRC. Fire up your IRC client and
point it to connect to the IRC channel
#shogun
at
irc.freenode.net
. You can also connect via
webchat
#shogun
directly in your browser. Note that we just recently started this channel (March 2011) and make chat logs
available
for your convenience.
In case you need to directly get in touch with us, feel free to contact
Soeren Sonnenburg:
Gunnar Raetsch:
Developer Information
Want to contribute ? We maintain SHOGUNs source code via git and are looking forward to your patches!
Class Design and Source Code
If you are interested in developing C++ applications with libshogun or want to extend shogun read the developer tutorial
Check out some basic examples on how to develop with libshogun
To browse the most up-to-date source code use
To access the up-to-date
source code
clone
git clone git://github.com/shogun-toolbox/shogun.git
To access themost up-to-date
data sets
clone
git clone git://github.com/shogun-toolbox/shogun-data.git
We keep mirrors for source code and data also at shogun-toolbox.org:
git clone git://shogun-toolbox.org/shogun.git
git clone git://shogun-toolbox.org/shogun-data.git
Related Projects
shogun
weka
kernlab
dlib
nieme
orange
java-ml
pyML
mlpy
pybrain
torch3
created
1999
1997
04-2004
2006
09-2006
06-2004
08-2008
08-2004
02-2008
10-2008
01-2002
03-2010
01-2010
10-2009
03-2010
03-2009
03-2010
08-2009
01-2009
11-2009
11-2009
11-2004
Main Language
C++
java
C++
C++
python
java
C++; python
python
python
C++
Main Focus
Large Scale Kernel Methods; String Features; SVMs
General Purpose ML Package
Kernel Based Classification/Dimensionality Reduction
Portability; Correctness
Linear Regression; Ranking; Classification
Visual Data Analysis
Feature Selection
Kernel Methods
Basic Algorithms
Reinforcement Learning
Kernel-based Classification
Feature matrix
The
pdf document
with the machine learning toolbox feature comparison that we originally submitted to JMLR can be found
here
An up-to-date version of this matrix is located at
Google Spreadsheet
. Please notify us about possible corrections and changes.
A comparison of shogun with the popular machine learning toolboxes weka, kernlab, dlib, nieme, orange, java-ml, pyML, mlpy, pybrain, torch3, scikit-learn. A '?' denotes unkown, '-' feature is missing. This table is availabe as a
google spreadsheet
feature
shogun
weka
kernlab
dlib
nieme
orange
java-ml
pyML
mlpy
pybrain
torch3
scikit-learn
General Features
Graphical User Interface
One Class Classification
Classification
Multiclass classification
Regression
Structured Output Learning
Pre-Processing
Built-in Model Selection Strategies
Visualization
Test Framework
Large Scale Learning
Semi-supervised Learning
Multitask Learning
Domain Adaptation
Serialization
Parallelized Code
Performance Measures (auROC etc)
Image Processing
Supported Operating Systems
Linux
Windows
Mac OSX
Other Unix
Language Bindings
Python
Matlab
Octave
C/C++
Command Line
Java
C#
Lua
Ruby
SVM Solvers
SVMLight
LibSVM
SVM Ocas
LibLinear
BMRM
LaRank
SVMPegasos
SVM SGD
other
Regression
Kernel Ridge Regression
Support Vector Regression
Gaussian Processes
Relevance Vector Machine
Multiple Kernel Learning
MKL
q-norm MKL
Classifiers
Naive Bayes
Bayesian Networks
Multi Layer Perceptron
RBF Networks
Logistic Regression
LASSO
Decision Trees
k-NN
Linear Classifiers
Linear Programming Machine
LDA
Distributions
Markov Chains
Hidden Markov Models
Kernels
Linear
Gaussian
Polynomial
String Kernels
Sigmoid Kernel
Kernel Normalizer
Feature Selection
Forward
Wrapper methods
Recursive Feature Selection
Missing Features
Mean value imputation
EM-based/model based imputation
Clustering
Hierarchical Clustering
k-means
Optimization
BFGS
conjugate gradient
gradient descent
bindings to CPLEX
bindings to Mosek
bindings to other solver
Supported File Formats
Binary
Arff
HDF5
CSV
libSVM/ SVMLight format
Excel
Supported Data Types
Sparse Data Representation
Dense Matrices
Strings
Support for native (e.g. C) types (char, signed and unsigned int8, int16, int32, int64, float, double, long double)
Acknowlegements
The authors gratefully acknowledge the support of DFG grant MU 987/2-1, MU 987/6-1, RA-1894/1-1 and the
PASCAL Network
of Excellence.
References
[1]
C.Cortes and V.N. Vapnik. Support-vector networks.
Machine Learning, 20(3):273--297, 1995.
[2]
C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines,
2001. Software available at
[3]
T.Joachims. Making large-scale SVM learning practical. In B.Schoelkopf,
C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods -
Support Vector Learning, pages 169--184, Cambridge, MA, 1999. MIT Press.
[4]
V. Sindhwani, S. S. Keerthi. Large Scale Semi-supervised Linear SVMs. SIGIR, 2006.
[5]
L. Zanni, T. Serafini, G. Zanghirati. Parallel Software
for Training Large Scale Support Vector Machines on
Multiprocessor Systems. JMLR 7(Jul), 1467-1492, 2006.
[6]
A.Zien, G.Raetsch, S.Mika, B.Schoelkopf, T.Lengauer, and K.-R.
Mueller. Engineering Support Vector Machine Kernels That Recognize
Translation Initiation Sites. Bioinformatics, 16(9):799-807, September 2000.
[7]
T.S. Jaakkola and D.Haussler.Exploiting generative models in
discriminative classifiers. In M.S. Kearns, S.A. Solla, and D.A. Cohn,
editors, Advances in Neural Information Processing Systems, volume 11,
pages 487-493, 1999.
[8]
K.Tsuda, M.Kawanabe, G.Raetsch, S.Sonnenburg, and K.R. Mueller.
A new discriminative kernel from probabilistic models.
Neural Computation, 14:2397--2414, 2002.
[9]
C.Leslie, E.Eskin, and W.S. Noble. The spectrum kernel: A string kernel
for SVM protein classification. In R.B. Altman, A.K. Dunker, L.Hunter,
K.Lauderdale, and T.E. Klein, editors, Proceedings of the Pacific
Symposium on Biocomputing, pages 564-575, Kaua'i, Hawaii, 2002.
[10]
G.Raetsch and S.Sonnenburg. Accurate Splice Site Prediction for
Caenorhabditis Elegans, pages 277-298. MIT Press series on Computational
Molecular Biology. MIT Press, 2004.
[11]
G.Raetsch, S.Sonnenburg, and B.Schoelkopf. RASE: recognition of
alternatively spliced exons in c. elegans. Bioinformatics,
21:i369--i377, June 2005.
[12]
S.Sonnenburg, G.Raetsch, and B.Schoelkopf. Large scale genomic sequence
SVM classifiers. In Proceedings of the 22nd International Machine Learning
Conference. ACM Press, 2005.
[13]
S.Sonnenburg, G.Raetsch, and C.Schaefer. Learning interpretable SVMs
for biological sequence classification. In RECOMB 2005, LNBI 3500,
pages 389-407. Springer-Verlag Berlin Heidelberg, 2005.
[14]
G.Raetsch, S.Sonnenburg, and C.Schaefer. Learning Interpretable SVMs
for Biological Sequence Classification. BMC Bioinformatics, Special Issue
from NIPS workshop on New Problems and Methods in Computational Biology
Whistler, Canada, 18 December 2004, 7:(Suppl. 1):S9, March 2006.
[15]
S.Sonnenburg.New methods for splice site recognition. Master's thesis,
Humboldt University, 2002. supervised by K.-R. Mueller H.-D. Burkhard and
G.Raetsch.
[16]
S.Sonnenburg, G.Raetsch, A.Jagota, and K.-R. Mueller. New methods for
splice-site recognition. In Proceedings of the International Conference on
Artifical Neural Networks, 2002. Copyright by Springer.
[17]
S.Sonnenburg, A.Zien, and G.Raetsch. ARTS: Accurate Recognition of
Transcription Starts in Human. 2006. (accepted).
[18]
S.Sonnenburg, G.Raetsch, C.Schaefer, and B.Schoelkopf,Large Scale
Multiple Kernel Learning, Journal of Machine Learning Research, 2006,
K.Bennett and E.P.-Hernandez Editors
[19]
M.Kloft, U.Brefeldt, S.Sonnenburg, A.Zien, P.Laskov, K.-R. Mueller, Efficient and Accurate Lp-Norm Multiple Kernel Learning,
Advances in Neural Information Processing Systems 21, MIT Press, Cambridge, MA,2009
[20]
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), 1871-1874. Software available at
[21]
V. Franc, S. Sonnenburg. Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization, Journal of Machine Learning Research 10(2009), 2157--2192, Software available at
[22]
S. Sonnenburg, V. Franc. COFFIN: A Computational Framework for Linear SVMs,
Research Report, Center for Machine Perception, K13133 FEE Czech Technical University, 2009
[23]
S. Sonnenburg, V. Franc. COFFIN: A Computational Framework for Linear SVMs. Proceedings of the 27nd International Machine Learning Conference, 2010.