Neural networks are mostly adopted using one of various open source implementations. Various mature deep learning frameworks are available for different programming languages.
Theano is a Python library that lets you define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data. It can also surpass C on a CPU by many orders of magnitude by taking advantage of recent GPUs.
Theano combines aspects of a computer algebra system (CAS) with aspects of an optimizing compiler. It can also generate customized C code for many mathematical operations. This combination of CAS with optimizing compilation is particularly useful for tasks in which complicated mathematical expressions are evaluated repeatedly and evaluation speed is critical. For situations where many different expressions are each evaluated once Theano can minimize the amount of compilation/analysis overhead, but still provide symbolic features such as automatic differentiation.
Theano’s compiler applies many optimizations of varying complexity to these symbolic expressions. These optimizations include, but are not limited to:
- use of GPU for computations
- constant folding
- merging of similar subgraphs, to avoid redundant calculation
- arithmetic simplification (e.g.,
x*y/x -> y
--x -> x)
- inserting efficient BLAS operations (e.g.
GEMM) in a variety of contexts
- using memory aliasing to avoid calculation
- using inplace operations wherever it does not interfere with aliasing
- loop fusion for elementwise sub-expressions
- improvements to numerical stability (e.g. and )
- for a complete list, see Optimizations
Torch is a scientific computing framework with wide support for machine learning algorithms that puts GPUs first. It is easy to use and efficient, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation.
A summary of core features:
- a powerful N-dimensional array
- lots of routines for indexing, slicing, transposing, …
- amazing interface to C, via LuaJIT
- linear algebra routines
- neural network, and energy-based models
- numeric optimization routines
- Fast and efficient GPU support
- Embeddable, with ports to iOS and Android backends
At the heart of Torch are the popular neural network and optimization libraries which are simple to use, while having maximum flexibility in implementing complex neural network topologies. You can build arbitrary graphs of neural networks, and parallelize them over CPUs and GPUs in an efficient manner.
Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license.
Check out caffe web image classification demo!
Expressive architecture encourages application and innovation. Models and optimization are defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices.
Extensible code fosters active development. In Caffe’s first year, it has been forked by over 1,000 developers and had many significant changes contributed back. Thanks to these contributors the framework tracks the state-of-the-art in both code and models.
Speed makes Caffe perfect for research experiments and industry deployment. Caffe can process over 60M images per day with a single NVIDIA K40 GPU*. That’s 1 ms/image for inference and 4 ms/image for learning and more recent library versions and hardware are faster still. We believe that Caffe is among the fastest convnet implementations available.
Community: Caffe already powers academic research projects, startup prototypes, and even large-scale industrial applications in vision, speech, and multimedia. Join our community of brewers on the caffe-users group and Github.
Apache MXNet is a modern open-source deep learning framework used to train, and deploy deep neural networks. It is scalable, allowing for fast model training, and supports a flexible programming model and multiple languages
MXNet is designed to be distributed on dynamic Cloud infrastructure, using distributed parameter server, and can achieve almost linear scale with multiple GPU/CPU.
MXNet supports both imperative and symbolic programming, which makes it easier for developers that are used to imperative programming to get started with deep learning. It also makes it easier to track, debug, save checkpoints, modify hyperparameters, such as learning rate or perform early stopping.
Supports an efficient deployment of a trained model to low-end devices for inference, such as mobile devices (using Amalgamation Amalgamation), IoT devices (using AWS Greengrass), Serverless (Using AWS Lambda) or containers. These low-end environments can have only weaker CPU or limited memory (RAM), and should be able to use the models that were trained on a higher-level environment (GPU based cluster, for example)
- Support for commonly used layers: convolution, RNN, LSTM, GRU, BatchNorm, and more.
- Model Zoo contains pre-trained weights and example scripts for start-of-the-art models, including: VGG, Reinforcement learning, Deep Residual Networks, Image Captioning, Sentiment analysis, and more.
- Swappable hardware backends: write code once and then deploy on CPUs, GPUs, or Nervana hardware
For fast iteration and model exploration, neon has the fastest performance among deep learning libraries (2x speed of cuDNNv4, see benchmarks).
- 2.5s/macrobatch (3072 images) on AlexNet on Titan X (Full run on 1 GPU ~ 26 hrs)
- Training VGG with 16-bit floating point on 1 Titan X takes ~10 days (original paper: 4 GPUs for 2-3 weeks)
The Microsoft Cognitive Toolkit (CNTK)
CNTK offers three different build versions. The CPU-only build uses the optimized Intel MKLML; MKLML is released with Intel MKL-DNN as a trimmed version of Intel MKL for MKL-DNN. The GPU implementation uses highly optimized NVIDIA libraries (such as CUB and cuDNN) and supports distributed training across multiple GPUs and multiple machines. The 1bit-SGD version is a special GPU build of CNTK that enables the MSR-developed 1bit-quantized SGD and block-momentum SGD parallel training algorithms, which allow for even faster distributed training in CNTK. Note that the 1bit-SGD package is not necessary for performing parallel training in CNTK; the GPU build will suffice.
- Components can handle multi-dimensional dense or sparse data from Python, C++ or BrainScript
- FFN, CNN, RNN/LSTM, Batch normalization, Sequence-to-Sequence with attention and more
- Reinforcement learning, generative adversarial networks, supervised and unsupervised learning
- Ability to add new user-defined core-components on the GPU from Python
- Automatic hyperparameter tuning
- Built-in readers optimized for massive datasets
- Parallelism with accuracy on multiple GPUs/machines via 1-bit SGD and Block Momentum
- Memory sharing and other built-in methods to fit even the largest models in GPU memory
- Full APIs for defining networks, learners, readers, training and evaluation from Python, C++ and BrainScript
- Evaluate models with Python, C++, C# and BrainScript
- Interoperation with NumPy
- Both high-level and low-level APIs available for ease of use and flexibility
- Automatic shape inference based on your data
- Fully optimized symbolic RNN loops (no unrolling needed)