Rabindra LamsalI believe, the hunger to learn should never saturate.
https://rlamsal.com.np/
Mon, 12 Aug 2019 03:31:40 +0000Mon, 12 Aug 2019 03:31:40 +0000Jekyll v3.8.5Title of post 1<p>Write your post content here in normal <code class="highlighter-rouge">markdown</code>. An example post is shown below for reference.</p>
<h3 id="introduction">Introduction</h3>
<p>Recurrent Neural Networks and their variations are very likely to overfit the training data. This is due to the large network formed by unfolding each cell of the RNN, and <em>relatively</em> small number of parameters (since they are shared over each time step) and training data. Thus, the perplexities obtained on the test data are often quite larger than expected. Several attempts have been made to minimize this problem using varied <strong>regularization</strong> techniques. This paper tackles this issue by proposing a model that combines several of such existing methods.</p>
<p><em>Merity et al</em>’s model is a modification of the standard <strong>LSTM</strong> in which <em>DropConnect</em> is applied to the hidden weights in the <em>recurrent</em> connections of the LSTM for regularization. The dropout mask for each weight is preserved and the same mask is used across all time steps, thereby adding negligible computation overhead. Apart from this, several other techniques have been incorporated :</p>
<ul>
<li><strong>Variational dropout</strong> : The same dropout mask is used for a particular recurrent connection in both the forward and backward pass for all time steps. Each input of a mini-batch has a separate dropout mask, which ensures that the regularizing effect due to it isn’t identical across different inputs.</li>
<li><strong>Embedding dropout</strong> : Dropout with dropout probability <script type="math/tex">p_e</script> is applied to word embedding vectors, which results in new word vectors which are identically zero for the dropped words. The remaining word vectors are scaled by <script type="math/tex">\frac{1}{1-p_e}</script> as compensation.</li>
<li><strong>AR and TAR</strong> : AR (Activation Regularization) and TAR (Temporal Activation Regularization) are modifications of <script type="math/tex">L_2</script> regularization, wherein the standard technique is applied to dropped <em>output activations</em> and dropped <em>change in output activations</em> respectively. Mathematically, the additional terms in the cost function <script type="math/tex">J</script> are (here <script type="math/tex">\alpha</script> and <script type="math/tex">\beta</script> are scaling constants and <script type="math/tex">\textbf{D}</script> is the dropout mask) :</li>
</ul>
<script type="math/tex; mode=display">J_{AR}=\alpha L_2\left(\textbf{D}_l^t\odot h_l^t\right)\\
J_{TAR}=\beta L_2\left(\textbf{D}_l^t\odot\left(h_l^t - h_l^{t-1}\right)\right)</script>
<ul>
<li><strong>Weight tying</strong> : In this method, the parameters for word embeddings and the final output layer are shared.</li>
<li><strong>Variable backpropagation steps</strong> : A random number of BPTT steps are taken instead of a fixed number, whose mean is very close to the original fixed value (<script type="math/tex">s</script>). The BPTT step-size (<script type="math/tex">x</script>) is drawn from the following distribution (here <script type="math/tex">\mathcal{N}</script> is the Gaussian distribution, <script type="math/tex">p</script> is a number close to 0.95 and <script type="math/tex">\sigma^2</script> is the desired variance) :</li>
</ul>
<script type="math/tex; mode=display">x \sim p\cdot \mathcal{N}\left(s,\sigma^2\right) + (1-p)\cdot \mathcal{N}\left(\frac{s}{2},\sigma^2\right)</script>
<ul>
<li><strong>Independent sizes of word embeddings and hidden layer</strong> : The sizes of the hidden layer and word embeddings are kept independent of each other.</li>
</ul>
<p>The paper also introduces a new optimization algorithm, namely <strong>Non-monotonically Triggered Averaged Stochastic Gradient Descent</strong> or NT-ASGD, which can be programmatically described as follows :</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">NT_ASGD</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">w0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">L</span><span class="p">,</span> <span class="n">lr</span><span class="p">):</span>
<span class="s">"""
Input parameters :
f - objective function
t - stopping criterion
w0 - initial parameters
n - non-monotonicity interval
L - number of epochs after which finetuning is done
lr - learning rate
Returns :
parameter(s) that minimize `f`
"""</span>
<span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">T</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">t</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">params</span> <span class="o">=</span> <span class="p">[];</span> <span class="n">logs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">w0</span>
<span class="k">while</span> <span class="n">t</span><span class="p">(</span><span class="n">w</span><span class="p">):</span>
<span class="c"># `func_grad` computes gradient of `f` at `w`</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">w</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">func_grad</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">w</span><span class="p">)</span>
<span class="n">params</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">w</span><span class="p">)</span>
<span class="n">k</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">k</span><span class="o">%</span><span class="n">L</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="c"># Compute model's perplexity for current parameters</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">perplexity</span><span class="p">(</span><span class="n">w</span><span class="p">)</span>
<span class="k">if</span> <span class="n">t</span> <span class="o">></span> <span class="n">n</span> <span class="ow">and</span> <span class="n">v</span> <span class="o">></span> <span class="nb">min</span><span class="p">(</span><span class="n">logs</span><span class="p">[</span><span class="n">t</span><span class="o">-</span><span class="n">n</span><span class="p">:</span><span class="n">t</span><span class="o">+</span><span class="mi">1</span><span class="p">]):</span>
<span class="n">T</span> <span class="o">=</span> <span class="n">k</span>
<span class="n">logs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="n">t</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="c"># Return the average of best `k-T+1` parameters</span>
<span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="n">params</span><span class="p">[</span><span class="o">-</span><span class="p">(</span><span class="n">k</span><span class="o">-</span><span class="n">T</span><span class="o">+</span><span class="mi">1</span><span class="p">):])</span><span class="o">/</span><span class="p">(</span><span class="n">k</span><span class="o">-</span><span class="n">T</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> </code></pre></figure>
<p>They also combined their <strong>AWD-LSTM</strong> (ASGD Weight Dropped LSTM) with a neural cache model to obtain further reduction in perplexities. A <em>neural cache model</em> stores previous states in memory, and predicts the output obtained by a <em>convex combination</em> of the output using stored states and the AWD-LSTM.</p>
<h3 id="network-description">Network description</h3>
<p><em>Merity et al</em>’s model used a 3-layer weight dropped LSTM with dropout probability <code class="highlighter-rouge">0.5</code> for <strong>PTB corpus</strong> and <code class="highlighter-rouge">0.65</code> for <strong>WikiText-2</strong>, combined with several of the above regularization techniques. The different hyperparameters (as referred to in the discussion above) are as follows : hidden layer size (<script type="math/tex">H</script>) = <code class="highlighter-rouge">1150</code>, embedding size (<script type="math/tex">D</script>) = <code class="highlighter-rouge">400</code>, number of epochs = <code class="highlighter-rouge">750</code>, <script type="math/tex">L</script> = <code class="highlighter-rouge">1</code>, <script type="math/tex">n</script> = <code class="highlighter-rouge">5</code>, learning rate = <code class="highlighter-rouge">30</code>, Gradients clipped at <code class="highlighter-rouge">0.25</code>, <script type="math/tex">p</script> = <code class="highlighter-rouge">0.95</code>, <script type="math/tex">s</script> = <code class="highlighter-rouge">70</code>, <script type="math/tex">\sigma^2</script> = <code class="highlighter-rouge">5</code>, <script type="math/tex">\alpha</script> = <code class="highlighter-rouge">2</code>, <script type="math/tex">\beta</script> = <code class="highlighter-rouge">1</code>, dropout probabilities for input, hidden outputs, final output and embeddings as <code class="highlighter-rouge">0.4</code>, <code class="highlighter-rouge">0.3</code>, <code class="highlighter-rouge">0.4</code> and <code class="highlighter-rouge">0.1</code> respectively.</p>
<p>Word embedding weights were initialized from <script type="math/tex">\mathcal{U}\left[-0.1,0.1\right]</script> and all other hidden weights from <script type="math/tex">\mathcal{U}\left[-\frac{1}{\sqrt{1150}},\frac{1}{\sqrt{1150}}\right]</script>. Mini-batch size of <code class="highlighter-rouge">40</code> was used for PTB and <code class="highlighter-rouge">80</code> for WT-2.</p>
<h3 id="result-highlights">Result highlights</h3>
<ul>
<li>3-layer AWD-LSTM with weight tying attained 57.3 PPL on PTB</li>
<li>3-layer AWD-LSTM with weight tying and a continuous cache pointer attained 52.8 PPL on PTB</li>
</ul>
Wed, 10 Jan 2018 15:10:00 +0000
https://rlamsal.com.np/blog/2018/01/post-1
https://rlamsal.com.np/blog/2018/01/post-1rnndiscussionTitle of post 2<p>Write your post content here in normal <code class="highlighter-rouge">markdown</code>. An example post is shown below for reference.</p>
<h3 id="introduction">Introduction</h3>
<p>Regularization is an important step during the training of neural networks. It helps to generalize the model by reducing the possibility of overfitting the training data. There are several types of regularization techniques, with L2 , L1, elastic-net (linear combination of L2 and L1 regularization), dropout and drop-connect being the major ones. While L2, L1 and elastic-net regularization techniques work by constraining the trainable parameters (or <em>weights</em>) from attaining large values (so that no drastic changes in output are observed for slight changes in the input; or in other words, they prefer <em>diffused</em> weights rather than <em>peaked</em> ones), <strong>dropout</strong> and drop-connect work by averaging the task over a dynamically and randomly generated large <em>ensemble</em> of networks. These networks are obtained by randomly disconnecting neurons (<em>dropout</em>) or weights (<em>drop-connect</em>) from the original network so as to obtain a subnetwork on which the training process is carried out (although for <script type="math/tex">\approx1</script> epoch for each, since the chance of the same subnetwork being generated again is very rare).</p>
<p>Application of dropout to feedforward neural networks gives promising results. RNNs are thought of as individual <em>cells</em> that are <em>unfolded</em> over several time-steps, with the input at each time-step being a token of the sequence. When dropout is used for regularizing such a network, the ‘disturbance’ it generates at each time step propagates over a long interval, thereby decreasing the network’s ability to represent long range dependencies. Thus, applying dropout in the standard manner to RNNs fails to give any improvement. It is here where <em>Zaremba et al</em>’s research comes into the picture.</p>
<h3 id="network-description">Network description</h3>
<p>The network architecture that <em>Zaremba et al</em> proposed is quite simple and intuitive. In case of deep RNNs (i.e. RNNs spanning over several layers (<script type="math/tex">h</script>) where output of <script type="math/tex">h_{l-1}^{t}</script> is used as the input for <script type="math/tex">h_l^t</script>), all connections between the cells in unfolded state can be broadly classified into two categories - <em>recurrent</em> and <em>non-recurrent</em>. The connections between cells in the same layer i.e. <script type="math/tex">h_l^t ~-~ h_l^{t+1}~~\forall t</script> are <em>recurrent</em> connections, and those between cells in adjacent layers i.e. <script type="math/tex">h_l^t ~-~ h_{l+1}^t~~\forall l</script> are <em>non-recurrent</em> connections. <em>Zaremba et al</em> suggested that <strong>dropout</strong> should be applied only to non-recurrent connections - thereby preventing problems which arose earlier.</p>
<p>The modified network for LSTM units can be mathematically represented as follows:</p>
<p>Denoting <script type="math/tex">T_{m,n}</script> as an affine transformation from <script type="math/tex">\mathbb{R}^m\rightarrow\mathbb{R}^n</script> ( i.e. <script type="math/tex">T_{m,n}(x)=Wx+b</script> where <script type="math/tex">x\in\mathbb{R}^{m\times1}</script>, <script type="math/tex">W\in\mathbb{R}^{n\times m}</script> and <script type="math/tex">b\in\mathbb{R}^{n\times1}</script> and similarly for multiple inputs) and <script type="math/tex">\otimes</script> as elementwise multiplication, we have :</p>
<script type="math/tex; mode=display">f_l^t=\text{sigmoid}\left(T_{N,D}^1\left(\textbf{D}\left(h_{l-1}^t\right) ; h_l^{t-1}\right)\right) \\
i_l^t=\text{sigmoid}\left(T_{N,D}^2\left(\textbf{D}\left(h_{l-1}^t\right) ; h_l^{t-1}\right)\right) \\
o_l^t=\text{sigmoid}\left(T_{N,D}^3\left(\textbf{D}\left(h_{l-1}^t\right) ; h_l^{t-1}\right)\right) \\
u_l^t=\text{tanh}\left(T_{N,D}^4\left(\textbf{D}\left(h_{l-1}^t\right) ; h_l^{t-1}\right)\right) \\
c_l^t=c_l^{t-1}\otimes f_l^t + u_l^t\otimes i_l^t \\
h_l^t = \text{tanh}\left(c_l^t\right)\otimes o_l^t</script>
<p>Here, <script type="math/tex">\textbf{D}</script> is the dropout <em>layer</em> or operator which sets a subset of its input randomly to zero with dropout probability <code class="highlighter-rouge">p</code>. This modification can be adopted for any other RNN architecture.</p>
<p><em>Zaremba et al</em> used these architectures for their experiments in which each cell was unrolled for 35 steps. Mini-batch size was 20 for both :</p>
<p><strong>Medium LSTM</strong> :
Hidden-layer dimension = <code class="highlighter-rouge">650</code>,
Weights initialized uniformly in <code class="highlighter-rouge">[-0.05,0.05]</code>,
Dropout probability = <code class="highlighter-rouge">0.5</code>,
Number of epochs = <code class="highlighter-rouge">39</code>,
Learning rate = <code class="highlighter-rouge">1</code> which decays by a factor of <code class="highlighter-rouge">1.2</code> after 6 epochs,
Gradients clipped at <code class="highlighter-rouge">5</code>.</p>
<p><strong>Large LSTM</strong> :
Hidden-layer dimension = <code class="highlighter-rouge">1500</code>,
Weights initialized uniformly in <code class="highlighter-rouge">[-0.04,0.04]</code>,
Dropout probability = <code class="highlighter-rouge">0.65</code>,
Number of epochs = <code class="highlighter-rouge">55</code>,
Learning rate = <code class="highlighter-rouge">1</code> which decays by a factor of <code class="highlighter-rouge">1.15</code> after 14 epochs,
Gradients clipped at <code class="highlighter-rouge">10</code>.</p>
<h3 id="result-highlights">Result highlights</h3>
<ul>
<li>78.4 PPL on Penn TreeBank dataset using a single <strong>Large LSTM</strong></li>
<li>68.7 PPL on Penn TreeBank dataset using an ensemble of 38 <strong>Large LSTM</strong>s</li>
</ul>
Wed, 10 Jan 2018 15:01:00 +0000
https://rlamsal.com.np/blog/2018/01/post-2
https://rlamsal.com.np/blog/2018/01/post-2rnnregularization