Eugeny Shtoltc - Machine learning in practice – from PyTorch model to Kubeflow in the cloud for BigData
- Название:Machine learning in practice – from PyTorch model to Kubeflow in the cloud for BigData
- Автор:
- Жанр:
- Издательство:неизвестно
- Год:2020
- ISBN:нет данных
- Рейтинг:
- Избранное:Добавить в избранное
-
Отзывы:
-
Ваша оценка:
Eugeny Shtoltc - Machine learning in practice – from PyTorch model to Kubeflow in the cloud for BigData краткое содержание
Machine learning in practice – from PyTorch model to Kubeflow in the cloud for BigData - читать онлайн бесплатно ознакомительный отрывок
Интервал:
Закладка:
When a neuron is trained with a teacher, we send training signals to it and get results at the output. For each input and output signal, we receive a result about the degree of error in prediction. When we ran all the training signals, we got a set (vector) of errors that can be represented as a function of errors. This error function depends on the input parameters (weights) and we need to find the weights at which this error function becomes minimal. To determine these weights, the Gradient Descent algorithm is used, the essence of which is to gradually move to the local target, and the direction of movement is determined by the derivative of this function and the activation function. The activation function is usually sigmoid for regular networks or truncated ReLU for deep networks. Sigmoid outputs a range from zero to one at all times. The truncated ReLU still allows for very large numbers (information is very important) at the input to transfer more than one to the output, there they themselves affect the layers that follow immediately after. For example, the dot above the dash separates the letter L from the letter i, and the information of one pixel affects the decision making at the output, so it is important not to lose this feature and transfer it to the last level. There are not so many varieties of activation functions – they are limited by the requirements for ease of training when it is required to take a derivative. So the sigmoid f after arbitrarily turns into f (1-f), which is effective. With Leaky ReLu (truncated ReLu with leakage) it is even simpler, since it takes the value 0 at x <0, then its wired in this section is also 0, and at x> = 0 it takes 0.01 * x, which with the derivative will be will be 0.01, and for x> 1 it takes 1 + 0.01 * x, which gives 0.01 for the derivative. Calculation is not required here at all, so learning is much faster.
Since we send the sum of the products of signals by their weights to the input of the activation function, then conceived, we need a different threshold level than from 0.5. We can shift it by a constant, adding it to the sum at the input to the activation function using the bias neuron to remember it. It has no inputs and always outputs one, and the offset itself is set by the weight of the connection with it. But, for multi-neural networks, it is not required, since the weights themselves by the previous layers are adjusted to such a size (smaller or negative) in order to use the standard threshold level – this gives standardization, but requires a larger number of neurons.
When training a neuron, we know the error of the network itself, that is, on the input neurons. Based on them, you can calculate the error in the previous layer and so on up to the input – which is called the Backpropagation method.
The learning process itself can be divided into stages: initialization, learning itself, and prediction.
If our figure can be of different sizes, then pooling layers are applied, which scale the image down. Which algorithm will calculate what will be written when merging depends on the algorithm, usually this is the “max” function for the “max pooling” or “avg” algorithm (mean-square value of neighboring matrix cells) – average pooling.
We already have a few layers. But, in neural networks used in practice, there can be a lot of them. Networks with more than four layers are commonly called deep neural networks (DML, Deep ML). But, there can be a lot of them, so there are 152 of them in ResNet and this is far from the deepest network. But, as you have already noticed, the number of layers is not taken, according to the principle, the more the better, but prototyped. An excessive amount degrades the quality due to attenuation, unless certain solutions are used for this, such as data forwarding with subsequent summation. Examples of neural network architectures include ResNeXt, SENet, DenseNet, Inception-Res Net-V2, Inception-V4, Xception, NASNet, MobileNet V2, Shuffle Net, and Squeeze Net. Most of these networks are designed for image analysis and it is the images that often contain the greatest amount of detail and the largest number of operations is assigned to these networks, which determines their depth. We will consider one of such architectures when creating a number classification network – LeNet-5, created in 1998.
If we need not just to recognize a number or a letter, but their sequence, the meaning inherent in them, we need a connection between them. For this, the neural network, after analyzing the first letter, poisons the result of the analysis of the current one along with the next letter to its input. This can be compared to dynamic memory, and a network that implements this principle is called recurrent (RNN). Examples of such networks (with feedbacks): Kohonen's network, Hopfield's network, ART-model. The recurrent network analyzes text, speech, video information, translates from one language into another, generates text descriptions for images, generates speech (WaveNet MoL, Tacotron 2), categorizes texts by content (belonging to spam). The main direction in which researchers are working in an attempt to improve in such networks is to determine the principle by which the network will decide which, for how long and how much the network will take into account the previous information in the future. Networks adopting specialized tools for storing information are called LSTM (Long-short term memory).
Not all combinations are successful, some only allow solving narrow problems. As the complexity increases, a smaller percentage of possible architectures are successful and bear their own names.
In general, there are networks that are fundamentally different in structure and principles:
* direct distribution networks
* convolutional neural networks
* recurrent neural networks
* autoencoder (classic, thin, variational, noise canceling)
* networks of trust ("deep belief")
* generative adversarial networks – opposition of two networks: generator and evaluator
* neural Turing machines – a neural network with a block of memory
* Kohonen neural networks – for unsupervised learning
* various architectures of circular neural networks: Hopfield neural network, Markov chain, Boltzmann machine
Let us consider in more detail the most commonly used, namely, feedforward, convolutional and recurrent networks:
Direct distribution networks:
* two entrances and one exit – Percetron (P)
* two inputs, two fully connected neurons with an output and one output – Feed Forward (FF) or Redial Basics Network (RBN)
* three inputs, two layers of four fully connected neurons and two Deep Feed Forward (DFF) outputs
* deep neural networks
* extreme propagation network – a network with random connections (neural echo network)
Convolutional neural networks:
* traditional convolutional neural networks (CNN) – image classification * unfolding neural networks – image generation by type * deep convolutional inverse graphic networks (DCEGC) – connecting convolutional and unrolling neural networks to transform or combine images
Recurrent neural networks:
* recurrent neural networks – networks with memory in neurons for sequence analysis, in which the sequence
matters such as text, sound and video
* Long Short Term Memory (LSTM) networks – the development of recurrent neural networks in which neurons can
classify data that are worth remembering into long-lived memory from those that are worth forgetting and delete information
from their memory
* deep residual networks – networks with connections between layers (similar in work to LSTM)
* recruited recute neurons (GRU)
Basics for writing networks.
Until 2015, scikit-learn was leading by a wide margin, which Caffe was catching up with, but with the release of TensorFlow, it immediately became the leader. Over time, only gaining a gap from two to three times by 2020, when there were more than 140 thousand projects on GitHub, and the closest competitor had just over 45 thousand. In 2020, Keras, scikit-learn, PyTorch (FaceBook), Caffe, MXNet, XGBoost, Fastai, Microsoft CNTK (CogNiive ToolKit), DarkNet and some other lesser known libraries are located in descending order. The most popular are the Pytorch and TenserFlow libraries. Pytorch is good for prototyping, learning and trying out new models. TenserFlow is popular in production environments and the low-level issue is addressed by Keras.
* FaceBook Pytorch is a good option for learning and prototyping due to the high level and support of various
environments, a dynamic graph, can give advantages in learning. Used by Twitter, Salesforce.
* Google TenserFlow – originally had a static solution graph, now dynamic is also supported. Used in
Gmail, Google Translate, Uber, Airbnb, Dropbox. To attract use in the Google cloud for it
Google TPU (Google Tensor Processing Unit) hardware processor is being implemented.
* Keras is a high-level tweak providing more abstraction for TensorFlow, Theano
or CNTK. A good option for learning. For example, he
allows you not to specify the dimension of layers, calculating it yourself, allowing the developer to focus on the layers
architecture. Usually used on top of TenserFlow. The code on it is maintained by Microsoft CNTK.
There are also more specialized frameworks:
* Apache MXNet (Amazon) and a high-level add-on for it Gluon. MXNet is a framework with an emphasis on
scaling, supports integration with Hadoop and Cassandra. Supported
C ++, Python, R, Julia, JavaScript, Scala, Go and Perl.
* Microsoft CNTK has integrations with Python, R, C # due to the fact that most of the code is written in C ++. That all sonova
written in C ++, this does not mean that CNTK will train the model in C ++, and TenserFlow in Python (which is slow),
since TenserFlow builds graphs and its execution is already carried out in C ++. Features CNTK
from Google TenserFlow and the fact that it was originally designed to run on Azure clusters with multiple graphical
processors, but now the situation is leveled and TenserFlow supports the cluster.
* Caffe2 is a framework for mobile environments.
* Sonnet – DeepMind add-on on top of TensorFlow for training super-deep neural networks.
* DL4J (Deep Learning for Java) is a framework with an emphasis on Java Enterprise Edition. High support for BigData in Java: Hadoop and Spark.
With the speed of availability of new pre-trained models, the situation is different and, so far, Pytorch is leading. In terms of support for environments, in particular public clouds, it is better for the farms promoted by the vendors of these clouds, so TensorFlow support is better in Google Cloud, MXNet in AWS, CNTK in Microsoft Azure, D4LJ in Android, Core ML in iOS. By languages, almost everyone has common support in Python, in particular, TensorFlow supports JavaScript, C ++, Java, Go, C # and Julia.
Читать дальшеИнтервал:
Закладка: