Personal Webpage of Max Horn

NeoVims built-in Language Server Client and why you should use it

Sat, 31 Oct 2020 00:00:00 +0000

As a fan of the Language Server Protocol introduced by Microsoft, I was very excited to hear, that NeoVim (an aggressive refactor of the Vim) will soon be shipping it’s own language server client.

Here I will show why I like the Language Server Protocol and how to configure the built-in language server client to make it a bit more user friendly.

When I first looked into using the NeoVims internal language server, jdhao’s blog post really helped me out – I invite you to check it out as an additional resource.

Table of Contents:

The Language Server Protocol – What’s all the fuss about
NeoVims internal Language Server Client
Faster completion
Getting more out of the language server - Diagnostics and Definitions
Summary

The Language Server Protocol – What’s all the fuss about

In essence the Language Server Protocol seeks to separate the functionality of an IDE into two components:

The Language Server, which is programming language specific. It analyzes the code in the programming language you are using and implements the typical set of functions an IDE provides such as looking up code completion, looking up definitions, renaming variables, searching for symbols etc.
The Language Server Client, typically a plugin of the editor, which provides the services of the Language Server to the user.

The Language Server Protocol defines a standardized way for these two components to interact with each other. An example of such a communication can be seen below¹.

The benefit of splitting the services of an IDE into two components becomes quite apparent: If every Language Server and Language Server Client follow the specifications, it is only necessary to implement a single Language Server per programming language and a single Language Server Client for each Editor. Thus all editors would benefit as soon as a language server implements new functionality and the keybindings and additional functionalities between different programming languages would be more consistent within a single editor.

Especially if you use Vim or NeoVim for programming this flexibility with regard to the programming language is amazing and makes your editor setup more lightweight (by only requiring the installation of a single plugin), consistent and flexible with regard to new programming languages. Simply setup the language server in your editor and you are good to go!

NeoVims internal Language Server Client

Due to these benefits, there were many implementations of language servers for vim (such as vim-lsp, LanguageClient-neovim, vim-lsc). Nevertheless I often experienced speed issues when using any of the plugins. Thankfully, NeoVim announced that they will be shipping a language server client built-in some time ago. It is not yet in the stable release but requires installing the developer branch of NeoVim. If you use brew this can for example be done using the command brew install neovim --HEAD, which will install the most recent version of NeoVim directly from the git repository.

Additionally, there is a lua-based plugin nvim-lspconfig which provides routines to integrate a diverse set of language servers. It’s installation is dependent on your plugin manager, for vim-plug it is installed by adding Plug 'neovim/nvim-lspconfig' to your NeoVim config.

Afterwards, you can configure your language server using one of the preconfigured profiles. For example, if you program in Python you first need to install the python language server

pip install 'python-language-server[all]'

and then add the configuration

lua << EOF
require('nvim_lsp').pyls.setup({})
EOF
autocmd Filetype python setlocal omnifunc=v:lua.vim.lsp.omnifunc

which executes the line between the EOF statements as lua code. This is required as we are configuring a plugin written in lua. We need to configure vim to actually use the Language Server Client for providing completions. This is done in the last line of the snippet by setting omnifunc, which allows us to trigger completion using <C-x><C-o>. While this works, it is by far not optimal, as the completion calls take place synchronously, at the end of the article I will give some pointers on how to improve the experience by using completion managers.

Faster completion

To my knowledge there are currently two completion managers which support the NeoVims built-in Language Server Client: NCM2 and completion-nvim where the first is written largely in python and the second is written in lua. When I first tested completion-nvim I ran into some issues which was probably due to the freshness of the project this can be a different story now. Nevertheless, I will here show how to set up NCM2 as this is what I went for in the end.

First you need to of course install and enable NCM2 for Plug this can be done using the following commands

call plug#begin('~/.vim/plugged')
" Installing ncm2 with Plug
Plug 'ncm2/ncm2'
Plug 'roxma/nvim-yarp'
call plug#end()

autocmd BufEnter * call ncm2#enable_for_buffer()

where the last line activates the completion manager for every buffer you enter.

NCM2 has support for the NeoVim built-in Language Server Client (see this pull request), but it does not yet seem to be well documented. The support can be activated by adding a callback to the language server setup routine we called previously

" Setup language server client
lua << EOF
require('nvim_lsp').pyls.setup({
    on_init = require('ncm2').register_lsp_source,
});
EOF

Getting more out of the language server - Diagnostics and Definitions

While this gives us some basic functionality, the Language Server is capable of doing much more! In the next section I will be concentrating on looking up definitions of functions/variables and displaying diagnostics. In the end I will give indications on some further useful functions. For a full list of features and details on how to use them in NeoVim please consider the NeoVim Language Server Client documentation.

Diagnostics

In particular I found the default setting of publishing diagnostic information as virtual text at the end of the line rather annoying. When I program I prefer to only see the code and not some additional text indicating that this function is missing a docstring etc. Changing the behaviour of NeoVims internal Language Server Client can be done by modifying the callbacks it triggers when it gets information from the Language Server. The default callbacks are defined here.

The callback of our interest is “textDocument/publishDiagnostics”. Here we simply want to remove the line util.buf_diagnostics_virtual_text(bufnr, result.diagnostics) which adds the virtual text annotations. We can patch this by adding the following lua code to our config (i.e. if you add this in your vim config you should surround it with a lua << EOF / EOF block, I left this out for the sake of correct syntax highlighting)

--- Evtl. add `lua << EOF` here
--- Define our own callbacks
local util = require 'vim.lsp.util'
local vim = vim
local api = vim.api
local buf = require 'vim.lsp.buf'

vim.lsp.callbacks['textDocument/publishDiagnostics'] = function(_, _, result)
  if not result then return end
  local uri = result.uri
  local bufnr = vim.uri_to_bufnr(uri)
  if not bufnr then
    err_message("LSP.publishDiagnostics: Couldn't find buffer for ", uri)
    return
  end

  -- https://microsoft.github.io/language-server-protocol/specifications/specification-current/#diagnostic
  -- The diagnostic's severity. Can be omitted. If omitted it is up to the
  -- client to interpret diagnostics as error, warning, info or hint.
  -- TODO: Replace this with server-specific heuristics to infer severity.
  for _, diagnostic in ipairs(result.diagnostics) do
    if diagnostic.severity == nil then
      diagnostic.severity = protocol.DiagnosticSeverity.Error
    end
  end

  util.buf_clear_diagnostics(bufnr)

  -- Always save the diagnostics, even if the buf is not loaded.
  -- Language servers may report compile or build errors via diagnostics
  -- Users should be able to find these, even if they're in files which
  -- are not loaded.
  util.buf_diagnostics_save_positions(bufnr, result.diagnostics)

  -- Unloaded buffers should not handle diagnostics.
  --    When the buffer is loaded, we'll call on_attach, which sends textDocument/didOpen.
  --    This should trigger another publish of the diagnostics.
  --
  -- In particular, this stops a ton of spam when first starting a server for current
  -- unloaded buffers.
  if not api.nvim_buf_is_loaded(bufnr) then
    return
  end
  util.buf_diagnostics_underline(bufnr, result.diagnostics)
  -- util.buf_diagnostics_virtual_text(bufnr, result.diagnostics)
  util.buf_diagnostics_signs(bufnr, result.diagnostics)
  vim.api.nvim_command("doautocmd User LspDiagnosticsChanged")
end
--- Evtl. add `EOF` here

Which does exactly the same as the original callback but comments out the line (-- is a comment in lua) that we mentioned before.

In order to now see diagnostic information when placing the cursor on a line with an error we can call the function vim.lsp.util.show_line_diagnostics() after some hover time using autocmd:

autocmd CursorHold  <buffer> lua vim.lsp.util.show_line_diagnostics()

Additionally, we probably want to set the colors used to indicate a error or warning to fit to the color scheme we using. This can be done by defining the appropriate highlight groups

" Highlighting applied to floating window
highlight LspDiagnosticsErrorFloating guifg=#fb4934 gui=NONE ctermfg=NONE ctermbg=NONE cterm=NONE
highlight LspDiagnosticsWarningFloating guifg=#fabd2f gui=NONE ctermfg=NONE ctermbg=NONE cterm=NONE
highlight LspDiagnosticsInfoFloating guifg=#83a598 gui=NONE ctermfg=NONE ctermbg=NONE cterm=NONE
" Highlighting applied to code
highlight LspDiagnosticsUnderlineError guifg=NONE guibg=NONE guisp=#fb4934 gui=undercurl ctermfg=NONE ctermbg=NONE cterm=undercurl
highlight LspDiagnosticsUnderlineWarning guifg=NONE guibg=NONE guisp=#fabd2f gui=undercurl ctermfg=NONE ctermbg=NONE cterm=undercurl
highlight LspDiagnosticsUnderlineInfo guifg=NONE guibg=NONE guisp=#83a598 gui=undercurl ctermfg=NONE ctermbg=NONE cterm=undercurl

The values above are set to match the gruvbox 8 color scheme and a terminal supporting true color. They can be adapted to your personal taste². If you want them to change dependent on your color scheme you should check out the vim predefines highlight groups (see :help highligh-groups) and map accordingly using highlight link. You should keep in mind to set these after setting your color scheme though otherwise you might run into problems.

Definitions

One of the best features of vim in my opinion is the tagstack (if you don’t know about it, check out :help tagstack and get your mind blown away ;) ). The tagstack allows you to do what you need to do when doing programming: Understanding what the function you are calling does under the hood. It enables jumping to the definitions of functions, variables etc. and when you got your information to jump back to where you came from. The Language Server Client allows you to jump to the definition of an object using :lua vim.lsp.buf.definition() and we could just manually map this to the key combination <C-]> which is used to jump to a tag. Yet this would discard all the other fancy combinations that are already defined for the tagstack such as jump to definition in new window/in a preview window etc. Thus the more wholistic approach would be to override the tagstack functionality using a custom tagfunc. I found a single reference on the Japanese internet where somebody does exactly this. I adopted the code slightly to work with the NeoVim internal Language Server Client³.

local lsp = require 'vim.lsp'
local util = require 'vim.lsp.util'
local log = require 'vim.lsp.log'
local vim = vim

-- Ref (in Japanese): https://daisuzu.hatenablog.com/entry/2019/12/06/005543
-- Ref: https://qrunch.net/@igrep/entries/K6sUDofcmvtnRqzk
function tagfunc_nvim_lsp(pattern, flags, info)
 local result = {}
 local isSearchingFromNormalMode = flags == "c"

 local method
 local params
 if isSearchingFromNormalMode then
   -- Jump to the definition of the symbol under the cursor
   -- when called by CTRL-]
   method = 'textDocument/definition'
   params = util.make_position_params()
 else
   -- NOTE: Currently I'm not sure how this clause is tested
   --       because `:tag` command doesn't seem to use `tagfunc`.

   -- Search with `pattern` when called by ex command (e.g. `:tag`)
   method = 'workspace/symbol'

   -- Delete "\<" from `pattern` when prepended.
   -- Perhaps the server doesn't support regex in vim!
   params = {}
   if string.find(flags, 'i') then
     params.query = string.sub(pattern, '^\\<', '')
   else
     params.query = pattern
   end
 end
 local client_id_to_results, err = lsp.buf_request_sync(0, method, params, 800)
 if err then
   print('Error when calling tagfunc: ' .. err)
   return result
 end

 for _client_id, results in pairs(client_id_to_results) do
   for i, lsp_result in ipairs(results.result) do
     local name
     local location
     if isSearchingFromNormalMode then
       name = pattern
       location = lsp_result
     else
       name = lsp_result.name
       location = lsp_result.location
     end
     local location_for_tagfunc = {
       name = name,
       filename = vim.uri_to_fname(location.uri),
       cmd = tostring(location.range.start.line + 1)
     }
     table.insert(result, location_for_tagfunc)
 end
 end
 return result
end

Be aware that the above code is written in lua thus it must either be integrated using the lua << EOF trick or saved in a separate file under ~/.config/nvim/lua and loaded prior to usage. In my case I save the above code under ~/.config/nvim/lua/tagfunc_nvim_lsp.lua and load it in vim using

lua require 'tagfunc_nvim_lsp'
setlocal tagfunc=v:lua.tagfunc_nvim_lsp

Additional goodies

Highlighting references to variable under the cursor

autocmd CursorHold  <buffer> lua vim.lsp.buf.document_highlight()
autocmd CursorHoldI <buffer> lua vim.lsp.buf.document_highlight()
autocmd CursorMoved <buffer> lua vim.lsp.buf.clear_references()

" References to the same variable
highlight LspReference guifg=NONE guibg=#665c54 guisp=NONE gui=NONE cterm=NONE ctermfg=NONE ctermbg=59
highlight! link LspReferenceText LspReference
highlight! link LspReferenceRead LspReference
highlight! link LspReferenceWrite LspReference

Formatting

The below snippet implements formatting the file when triggering a write to disk. It needed some additional fixes to prevent the cursor position from being lost after the reformat⁴.

function! Preserve(command)
    try
        " Preparation: save last search, and cursor position.
        let l:win_view = winsaveview()
        let l:old_query = getreg('/')
        silent! execute 'keepjumps ' . a:command
    finally
        " try restore / reg and cursor position
        call winrestview(l:win_view)
        call setreg('/', l:old_query)
    endtry
endfunction

autocmd BufWritePre <buffer> call Preserve('lua vim.lsp.buf.formatting_sync(nil, 1000)')

Renaming variables

The below snippet allows you to rename the variable under the cursor:

function! LspRename()
    call inputsave()
    let l:newname = input('Rename to: ')
    call inputrestore()
    call luaeval('vim.lsp.buf.rename("'.l:newname.'")')
endfunction

nnoremap <buffer> <leader>lr <cmd>call LspRename()<CR>

Keybindings

In my vim config I wrapped a lot of the above functionality together into a function called SetupLsp which I then call dependent on the input file type.

function SetupLsp()
    nnoremap <silent> <buffer> gd    <cmd>lua vim.lsp.buf.declaration()<CR>
    nnoremap <silent> <buffer> K  <cmd>lua vim.lsp.buf.hover()<CR>
    nnoremap <silent> <buffer> <c-k> <cmd>lua vim.lsp.buf.signature_help()<CR>
    inoremap <silent> <buffer> <c-k> <cmd>lua vim.lsp.buf.signature_help()<CR>
    nnoremap <silent> <buffer> <leader>ls    <cmd>lua vim.lsp.buf.document_symbol()<CR>
    nnoremap <silent> <buffer> gW    <cmd>lua vim.lsp.buf.workspace_symbol()<CR>
    nnoremap <buffer> <leader>lr <cmd>call LspRename()<CR>
    autocmd CursorHold  <buffer> lua vim.lsp.buf.document_highlight()
    autocmd CursorHold  <buffer> lua vim.lsp.util.show_line_diagnostics()
    autocmd CursorHoldI <buffer> lua vim.lsp.buf.document_highlight()
    autocmd CursorMoved <buffer> lua vim.lsp.buf.clear_references()
    autocmd BufWritePre <buffer> call Preserve('lua vim.lsp.buf.formatting_sync(nil, 1000)')
    lua require 'tagfunc_nvim_lsp'
    setlocal tagfunc=v:lua.tagfunc_nvim_lsp
    setlocal signcolumn=yes
endfunction

function SetupPython()
    setlocal colorcolumn=80
    setlocal tw=79
    setlocal spell
    setlocal tabstop=4 shiftwidth=4 expandtab
    call SetupLsp()
endfunction

autocmd Filetype python call SetupPython()

Summary

I hope you found the above pointers and snippets helpful. For me switching to the built-in language server client gave a huge jump in performance and responsiveness of my favorite editor :). Let me know if you have any comments, suggestions or issues!

See you next time!

The visualization is taken form the language server protocol documentation available here. ↩
Check out :help highlight-cterm and :help highlight-gui for the available settings. ↩
There was a further reference on the path to accomplishing this here yet the webpage seems down now. ↩
For further information see this stackexchange discussion, from which the command originated. ↩

Two of our papers were accepted at ICML 2020

Fri, 17 Jul 2020 00:00:00 +0000

This week we presented two of our papers at ICML 2020, it was a great experience to talk with others about our research and to think of future directions and applications of our methods. I want to use this page to point out reference materials for these publications.

Set Functions for Time Series

“Set Functions for Time Series” is work done together with Michael Moor, Christian Bock, Bastian Rieck and my PhD Advisor Karsten Borgwardt.

In the core we propose to rephrase learning on irregularly sampled time series data as a set classification problem. This mitigates the necessity of imputing time series prior to the application of Deep Learning models and allows their direct application. Michael Larionov wrote a nice summary of our paper on towards data science where he explains the core components of the model.

For further details please see the links below:

Paper: ICML 2020, arXiv

Talk: ICML 2020, Slides

Code: Models, Datasets

Topological Autoencoders

The work “Topological Autoencoders” was joint work with my colleagues Michael Moor and Bastian Rieck and my PhD Advisor Karsten Borgwardt.

In this work we propose to constrain the topology of the latent representation of an autoencoder using methods from topological data analysis. Michael wrote a wonderful blog post about the paper giving an intuitive introduction here. Below you can see our devised approach in action compared to a vanilla autoencoder.

Paper: ICML 2020, arXiv

Talk: ICML 2020, Slides

Code: GitHub

MLSS2020 Causality Lectures - a brief summary

Fri, 05 Jun 2020 00:00:00 +0000

The first week of the virtual Machine Learning Summer School in Tübingen is over and it is time to take a brief look back at the lessons learned and the experience made during this time. In the following I will briefly some of the insights I made during the first week.

Causality and Causal Inference lectures

The Causality lectures were held by Bernhard Schölkopf and Stefan Bauer. As a heads up, I am most definitely not an expert in this field and I am solely summarizing the most interesting points according to my personal opinion :).

Causality I

The first lecture by Prof. B. Schölkopf was a more general introduction to causality, the required terms needed to understand the literature, how to infer causal structure in smaller scale experiments with stochastic or deterministic relationships between observations and finally the implications of causality to semi-supervised learning.

Here I found the work on deriving causality for deterministic cases of particular interest (see this paper). In this work, the authors use an assumption termed the independence of input and mechanism to derive a causal inference rule which does not require any assumptions on noise. The assumption states, that $ p(C) $ (the probability distribution of the cause) and $p(E|C)$ (the probability distribution of the effect conditional on the cause) should be independent. This implies, that the distribution of the effect would then in some way be dependent on the function $f$ mapping from cause to effect and thus $Cov(log f’, p_C) = 0$ (encoding the independence assumption) and $Cov(log f^{-1’}, p_E) > 0$ (encoding the dependence between the function mapping from cause to effect and the distribution of the effect). This leads to the inference rule that $X \rightarrow Y$ if $\int \log | f’(x) | p(x) dx \leq \int \log | f^{-1}(y)| p(y) dy $. This can also be computed using empirical estimators for the slope of the function mapping between X and Y.

Finally, B. Schölkopf presented implications of causality in the domain of semi-supervised learning. In particular, if independence of input and mechanism is true, he shows that semi-supervised learning can theoretically not benefit learning in the causal direction. In other words when a model is trying to infer effect from cause (thus $p(E|C)$), additional data from the cause distribution $p(C)$ will not help the model learn due to $ p(C) $ being independent of $ p(E|C) $. In contrast, if learning is in the anti-causal direction, semi-supervised learning can be beneficial.

Causality II

The second talk by Stefan Bauer focused on how to infer Structural Causal Models and how causality can be integrated with Deep Learning approaches.

In the first part of his talk, Stefan showed bridges between causal inference and non-linear ICA and further shows that there will always be both a causal as well as anti-causal linear model if additive Gaussian noise is assumed. Afterwards, Stefan talks about time series models and how ODEs are perfect models for causal structure in the presence of time.

In the second half, the topic switches to describing a causal perspective on representation learning. Here the bridges between disentangles representations and causality become evident. This is due to the assumption, that a change in distribution of the data would arise from a sparse change in causal conditionals or causal mechanisms, thus leading to similar properties as desired in disentangled representations. Nevertheless, recent research shows that disentangled representations cannot be learned in a completely unsupervised way (see Locatello et al, ICML 2019), also leading to potential issues with the discovery of causal mechanisms. There is some hope though, as few labels seem to help with determining correct disentangled representations (see Locatello et al., ICML 2020).

Finally, Stefan talks about some exciting work on encoding causal structure into machine learning architectures. Here a decoder is designed to resemble the structure of a general Structural Causal Model and trained to match the observations of the training data. These Structural Causal Autoencoders (see Leeb et al., under review NeurIPS 2020) were shown to yield good performance in learning representations for generating images and transferring between similar tasks.

All in all some very exciting directions. I am looking forward to seeing the future development of Causality and Machine Learning.

Summary of Talks and Posters from ICLR 2020

Sat, 23 May 2020 00:00:00 +0000

This years ICLR was was special for me (and for many others as well) as it was the first virtual conference I have ever attended to (and will given the current situation with COVID-19 surely not be the last). At the virtual ICLR, posters were replaced with short prerecorded videos of 5 minutes where the authors briefly present their work. These videos can be accessed at any time independent of the “poster session” in which was possible to talk to one or more authors of the paper. One benefit of the conference being completely virtual is definitely that it allows to spread out looking at the “posters” over a longer time (if willing to sacrifice the possibility to talk with the author in a virtual poster session).

Overall, I found this format very appealing, as it allows to get a quick perspective into the work and also allows to look at many works without getting overwhelmed as the presentations are usually kept at a high level. For more detailed information it is always possible to pull up the paper.

A few weeks after the virtual conference is over I finally managed to summarize some of the papers and posters I looked at, and thought I might as well share it with the people who could be interested. You can find a subset of the papers I selected for reading accompanied with a sometimes more, sometimes less detailed summary below. I must say that I am still not completely finished with all the papers I marked and thus might update this page at a later time.

Table of Contents
Anomaly Detection
- Deep Semi-Supervised Anomaly Detection
- Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models
Generative Models
- Understanding the Limitations of Conditional Generative Models
- Your classifier is secretly an energy based model and you should treat it like one
Gradient Estimation
- Estimating Gradients for Discrete Random Variables by Sampling without Replacement
- SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models
Pooling and Set Functions
Representation Learning
Seq2Seq models
Understanding Deep Learning
Other - Normalizing Flows, Robustness, Meta-Learning, Probabilistic Modelling

Anomaly Detection

Deep Semi-Supervised Anomaly Detection

Anomaly detection in the Semi-supervised setting: Additional labels samples to leverage expert knowledge

Goal: Improve decision boundary using the labeled data.

A supervised classifier usually performs bad on unseen samples. Unsupervised approaches cannot use the labels examples to improve.

Suggest the Deep SAD method with is an extension of the Deep SVDD method. The SAD method is unsupervised and tries to map examples (non-anomalies) into a as compact as possible hypersphere.

(I wonder how this is different to maximum likelihood training of a simple generative models such as a flow? In the end we simply penalize the distance from the mode. Of course this approach does not penalize contraction / stretching of the space such that it can map all examples to a very small volume).

The SVDD approach tries to minimize the following objective:

\[\frac{1}{n} \sum_{i=1}^n || \phi(X_i; \theta) - c ||^2\]

and the authors of this paper suggest including an additional term which penalizes anomalous samples from being close to the center of the sphere. The objective then becomes:

\[\frac{1}{n+m} \sum_{i=1}^n || \phi(X_i; \theta) - c ||^2 + \frac{\eta}{n+m} \sum_{j=1}^m(|| \phi(X_i; \theta) - c ||^2 )^{y_j}; \eta > 0\]

where $y_j=1$ for normal samples and $y_j=-1$ for anomalies.

Makes perfect sense. Additionally, the authors benchmark their method and present an information-theoretic framework for deep anomaly detection in their paper.

Input Complexity and Out-of-distribution Detection with Likelihood-based Generative Models

Obvious strategy of generative models for detecting OOD samples (train model on data, label low likelihood samples as OOD) does not work. Often the out of distribution samples actually have significantly higher likelihood than the training distribution.

Intuition of paper: The input complexity of the dataset images influences the likelihood.

Analyse normalized size of image after being compressed with lossless compression which serves as a proxy for the Kolmogorov complexity. Complexity of image seems to correlate with the likelihood (more complex, less likely). Most variance is explained by this.

Suggest to correct the likelihood by subtracting the complexity estimate of the image, and show that this can be interpreted as a likelihood ratio test statistic giving links to Bayesian model comparison, minimum description length and Occams razor.

Results indicate higher performance compared to many out of distribution detection methods. The only method based on generative modelling which out performs the approach is WAIC which relies on ensembles of generative models and thus is sufficiently more costly to optimize.

Input complexity is the main culprit for out of distribution detection and proxies of input complexity can be used to to design corrected scores.

Generative Models

Understanding the Limitations of Conditional Generative Models

Main problem of conditional generative models: Not robust, do not recognise out of distribution samples.

Ideally, in a good generative model undetected (high density) adversarial attacks should have a low volume. Problem: Even if adversarial attacks are close to this low volume set the close area itself would again be large due to the curse of dimensionality. The theory of the paper suggests that one can always construct such a adversarial attack such that it would not be detected (?, have to check on this in the paper).

Results: MNIST model is robust, CIFAR model to detect. –> Interpolated examples (in data space) on CIFAR have higher likelihood than the samples themselves Explanation: Classification and Generation are very different things. Classification only cares for very few aspects of the data, while generation tries to model every single aspect of the data.

The authors suggest that the class-unrelated entropy (background, nascence variables) in CIFAR is the reason for these models failing. Demonstrate this with new dataset where MNIST digits are on CIFAR background and thus increasing the class-unrelated entropy. Then MNIST also fails similarly to CIFAR in the previous example.

Authors argue that much of the problem comes from the standard likelihood objective which tries to model everything, while for classification one only cares about selected bits of the data.

Your classifier is secretly an energy based model and you should treat it like one

Criticism: Progress in generative models driven by Likelihood and sample quality. Not by downstream tasks such as:

Out of Distribution detection
Robust classifications
semi-supervised learning In practice, the generative model based solutions on these tasks usually lag behind compared to engineered solutions. Why?
Not flexible enough?
Architecture only good for generation not classification etc.
Additional modelling constraints are making models worse for discrimination (such as invertibility for flows)

Alternative: Use energy based models!

Very flexible! We only need a energy function
But we cannot easily compute the normalizing constant which leads to some problems with regard to sampling and training

How to train? While log likelihood does not have a nice form, the gradient of log p can be written quite simply:

\[\frac{\partial p_{\theta}(x)}{\partial \theta} = E_{p_{\theta}(x')} [ \partial E_{\theta}(x) / \partial \theta ] - \partial E_{\theta}(x) / \partial \theta\]

where the last term is evaluated on the data and the first uses samples from the model (which is also tricky, usually done using MCMC).

Contribution:
Take classifier models and instead of using softmax define a energy based model using the inputs before the softmax. Where the energy is the negative output of the class indexed preactivations: $E_{\theta}(x, y) = -f_{\theta}(x)[y]$ This EBM can be trained and later used to predict the class using basic rules of probability, which simply results in a softmax. Further we can sum $y$ out to which also results in a EBM of the form: $E_{\theta}(x) = - LogSumExp_y(f(x)[y])$, which is a purely generative model.

What to do with this insight? - Experiments
Train factorized distribution to ensure unbiased training of classifier: $p(x) + p(y|z)$ as it does not require sampling from the joint distribution which could be biased. (This is actually reflected in the results, which show poor performance when the distribution is factored). Nevertheless, this hybrid model is the only one which yields comparable classification performance to STOA. Further, JEM improves calibration of the models significantly compared to baseline approaches. Also the JEM model is better at recognizing out of distribution samples and adversarial examples. This is done using a trick where they seed the MCMC chain at the position of the adversarial sample and execute a few MCMC steps with respect to the learned data distribution, further improving adversarial robustness significantly.

Nevertheless, training is still very unstable due to MCMC sampling. Learning is also hard to diagnose as there is no clear loss definition.

Gradient Estimation

Estimating Gradients for Discrete Random Variables by Sampling without Replacement

Discrete variables do not allow the computation of a gradient which is needed for gradient based optimization. This problem is usually mitigated by one of two approaches:

Relaxation of the discrete distribution to a continuous distribution (+ evtl. sampling) such as Gumbel-Softmax or Concrete methods. These unfortunately usually have a high bias.
Sampling and deriving stochastic gradients usually based on REINFORCE. REINFORCE gradients usually have high variance.

Authors say that REINFORCE is basically a “trick” for pulling the gradient operation inside the expectation:

\[\begin{aligned} \nabla_\theta E_{p_\theta(x)}[f(x)] &= E_{p_\theta(x)} [ \nabla_\theta \log p_\theta(x) f(x) ] \\\\ &\approx \nabla_\theta p_\theta (x) f(x) \\\\ &\approx \frac{1}{k} \sum_{i=1}^k \nabla_\theta p_\theta (x_i) f(x_i) \end{aligned}\]

Further, one can use the average of the other samples in order to reduce the variance of the estimate (REINFORCE with baseline, Mnih & Rezende 2016):

\[\nabla_\theta E_{p_\theta(x)}[f(x)] \approx \frac{1}{k} \sum_{i=1}^k \nabla_\theta p_\theta (x_i) \left( f(x_i) - \frac{\sum_{j \neq i} f(x_j)}{k - 1} \right)\]

In contrast to previous work the authors suggest to sample without replacement, as duplicate samples are uninformative in a deterministic setting. This leads to a sequence of ordered samples drawn from the distribution such that

\[p(B) = p(b_1) \times \frac{p(b_2)}{1-p(b_1)} \times \frac{p(b_3)}{1 - p(b_2) - p(b_3)}\]

In their paper, they derive a generic estimator for $E[f(x)]$ by Rao-Blackwellizing the crude single sample estimator (which is based on a single Monte Carlo sample). They call this estimator the unordered set estimator:

\[E_{p_\theta(x)}[f(x)] = E_{p_\theta(S^k)}[e^{US}(S^k)] = E_{p_\theta(S^k)} \left[\sum_{s \in S^k} p(s) R(S^k, s) f(s)\right]\]

where $R(S^l, s) = \frac{p^{D \setminus { s } } (S^k \setminus { s } )}{p(S^k)}$ is the leave-one-out ratio and $S^k$ is an unordered sample without replacement. They then apply REINFORCE to the derived estimator for the computation of gradients which gives rise to the unordered set policy gradient estimator:

\[\begin{aligned} & E_{p_\theta(x) [ \nabla_\theta \log p_\theta (s) f(x) ] \\\\ &= E_{p_\theta(S) [ e^{USPG}(S^k)] \\\\ &= E_{p_\theta(S) \[ \sum_{s \in S^k} p_\theta(s) R(S^l, s) \nabla_\theta \log p_\theta(s) f(s) \] \\\\ &= E_{p_\theta(S) \[ \sum_{s \in S^k} R(S^l, s) \nabla_\theta p_\theta(s) f(s)\] \end{aligned}\]

where the last step can be derived using the log derivative trick.

Further, the authors use an approach similar to Mnih & Rezende (2016) in order to reduce the variance by subtracting a baseline based on the other samples. This is not entirely trivial though, as the samples are not independent and thus a correction needs to be applied.

Experiments:

Synthetic example: Shows lowest gradient variance compared to all other methods
Policy for travelling salesman problem: Comparable to biased approaches and out-performs all unbiased approaches

The devised estimator is a low variance unbiased estimator which can be used as a replacement to Gumbel-Softmax.

SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models

Suggest a new gradient estimator for marginal probabilities.

Maximum likelihood estimation for Latent Variable Models (LVM) requires unbiased estimates of $\nabla_\theta \log p_\theta(x)$ which are not directly available. Instead research has focused on developing lower bounds for the log marginal probability $\log p_\theta(x)$ such as the ELBO or IWAE and to optimize the model parameters with respect to this lower bound.

The approach devised in this paper is composed of two components:

Importance weighted bounds (IWAE): Here the idea is to use multiple samples, instead of a single sample as suggested with the ELBO. This results in an increasing tighter bound on the true marginal log-likelihood, such that
\[\begin{gather} ELBO \leq E[IWAE_1(x)] \leq E[IWAE_2(x)] \leq \dots \leq \log p_\theta(x) \\\\ \log p_\theta(x) = \lim_{K \rightarrow \inf} E[IWAE_K(x)] \end{gather}\]
where $K$ denotes the number of samples used to compute the lower bound.
Russian roulette estimators: Estimator used to compute the value of an infinite series
\[\sum^\inf_{k=1} \Delta_k = E_{K \sim p(K)} \left[ \sum^K_{k=1} \frac{\Delta_k}{P(\mathcal{K} \geq k)} \right]\]
which basically weighs each term in by the probability of sampling a larger k. This is true is the series converges absolutely s.th. $\sum_{k=1}^\inf |\Delta_k| \lt \inf$.

The authors suggest SUMO (Stochastically Unbiased Marginalization Objective), by combining the IWAE lower bound with the Russian Roulette estimator:

\[\begin{gather} \Delta_k (x) = IWAE_{k+1}(x) - IWAE_k(x) \\\\ SUMO(x) = IWAE_1(x) + \sum^K_{k=1} \frac{\Delta_k (x)}{P(\mathcal{K} \leq k)} \end{gather}\]

where $K \sim p(K)$. This objective is unbiased, such that $E[SUMO(X)] = \log p_\theta(x)$, and under some conditions (that the gradient of SUMO is bounded and differentiable everywhere) that $E[\nabla_\theta SUMO(x)] = \nabla_\theta E[SUMO(x)] = \log p_\theta (x)$. Deciding on $p(K)$ determines the variance of the estimator and the compute cost.

Applications:

Minimizing $\log p_\theta (x)$, which occurs in reverse-KL objectives
As an unbiased score function for examples in HMC and REINFORCE gradient estimation

Results: Better test NLL, more stable in entropy maximization

Pooling and Set Functions

FSPool: Learning Set Representations with Featurewise Sort Pooling

Discuss architectures for computing deep set representations.

Issue of Jump discontinuity: When we are rotating the input set elements and compute the representation of a set and decode it into a set again, there comes a point where the decoded set element jumps back by one position, such that it is again in the same position (the visualization in their talk is quite good and easier to understand than my explanation here).

With very many points the network would simply give up and just predict a constant output although the input is being rotated.

Suggest FSPool:

Sort inputs (which is ok as it is a set)
Multiply with a set of learned weights. As these are not always the same size, the weights are interpreted as a piecewise linear function between 0 and 1, and the values used for the dot product are evaluated on an evenly spaced grid between 0 and 1 such that the correct number of weights for any size of input can be obtained.
This is done for each feature individually. (Which seems to result in loss of information regarding the joint distribution?)

This helps to mitigate the issue with jump discontinuity for set autoencoders as the learnt permutation can simply be inverted in the decoder such that no jumpy needs to occur and the output would always correspond to the matching input. This then also removes the necessity of matching input and output elements.

On Universal Equivariant Set Networks

Regarding approximation power of deep equivariant networks.

DeepSets:

\[X \mapsto XA + \mathbf{1}\mathbf{1}^\top XB + 1 c^\top\]

Authors call $X \mapsto \mathbf{1}\mathbf{1}^\top XB$ a linear transmitting layer. Further they note that setting $B=0$ for all layers results in a model which simply applies an MLP on each row of X, and refer to it as PointNet.

Authors derive the requirements for universal approximation of equivariant functions on the unit cube. This is not the case for PointNet, as it cannot approximate the simple function $x \mapsto 1^\top x 1$.

Main theorem: PointNet is not equivariant universal, but PointNet with single linear transmitting layer is. In particular the DeepSets model is equivariant universal.

Proof:

Stone-Weierstrass, any continuous equivariant function can be approximated by equivariant polynomial on the unit cube
Construct model with linear transmitting layer that approximates any permutation equivariant polynomial

The suggested model consists of two PointNets, and a single linear transmitting layer.

StructPool: Structured Graph Pooling via Conditional Random Fields

Why is graph pooling challenging? This is no locality information in a graph as the number of neighboring nodes is not fixed. (slightly confused by this statement, are neighbors not the definition of locality ?)

Until now there are two pooling approaches:

Selection of important nodes via node sampling, could loose node information if a node is not selected.
Graph pooling via clustering, cluster nodes together which represent then represent a new node in the next iteration

Other work DiffPool: Suggests to use a GCN to predict an assignment matrix which defines which nodes are merged. This only uses the node features and does not incorporate structural information.

The authors suggest StructPool: Where the pooling depends on the node features and the high-order structural relationship in the graph. They formulate the assignment problem as a conditional random field where the goal is to minimize the Gibbs Energy. Basically, they add another pairwise energy term (derived from an attention mechanism) which looks at pairs of nodes which are within l-hop distance from each other which create an additional pairwise energy to the unary energy of the conditional random field. The two energies are combined and then used the compute the assignment matrix using the softmax operation.

The proposed method shows improvement over other pooling techniques on D&D, COLLAB Proteins IMDB-B and IMDB-M whereas it has slightly lower performance on Enzymes.

Representation Learning

Disentanglement by Nonlinear ICA with General Incompressible-flow Networks (GIN)

Non-linear ICA theory: Can recover non-linear projections of conditionally independent distributions (in latent space) in the data space. This requires the conditionally independent distributions to belong to the exponential family. Given some additional requirements, the theory implies that the sufficient statistics of the true generating latent space are a linear transformation of the recovered sufficient statistics in the latent space.

For disentanglement exactly one variable of the reconstructed latent space should be associated with one in the true latent space. This is equivalent to requiring sparsity in the transformation matrix.

Authors show that this is given for Gaussian latent spaces and claim that additional latent dimensions in the recovered latent space will solely encode noise. This gives rise to simultaneous disentanglement and dimensionality discovery mechanism.

They suggest a method based on volume preserving flows (which are thus incompressible). This is implemented based on RealNVP, where the (pre-exponentiated) scale of the last component is set to the negative of the sum of all previous components, enforcing the same volume. The authors argue, that this constraint makes the standard deviation in the latent space remain meaningful, as variability can only be shifted between dimensions but not increased.

They show that the spectrum of standard deviations shows multiple regimes corresponding to global, local and noise on EMNIST to support this claim.

Experiments on artificial data show that the proposed approach recovers the true latent space if distributions sufficiently overlap. Further they find very convincing latent dimensions for EMNIST (more realistic that anything I have seem to date).

InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization

Prior work: Graph kernel, Graph2vec requires manual construction of features of importance. Aim of work: Automated discovery of these features.

InfoGraph: Model which tries to maximize the MI between patch representations (subgraphs) and global representations (hole graph). This should create a global representation of the graph which preserves aspects of all pathces and all scales. Approach is competitive in classification.

Extension to semi-supervised scenarios: Use student-teacher like architecture: Student model learns in a supervised manner, whereas the teacher learns on all unlabeled data using the previously devised InfoGraph approach. In order for the student model to learn from the teacher model, they propose to maximize the mutual information of intermediate layers of the GNN and the final representation. Thus the student is slightly biased to exploit similar structures as the unsupervised teacher. This leads to better performance than simply combining supervised and unsupervised loss for a single model.

Mutual Information Gradient Estimation for Representation Learning

Problem: High bias or high variance in existing approaches for MI.

Hypothesis of work is that while the approximated loss landscape could potentially be very noisy deriving only the gradient without prior computation of the loss could lead to much lower noise and variance.

Derive gradient of MI and use reparameterization trick in order to make the computation tractable and obtain $\nabla_z \log q(E_{\phi}(x))$ and $\nabla_{x,z} log q(x, E_{\phi}(x))$ via score estimation.

Use Spectral Stein Gradient estimator for score estimation of implicit distributions. Reduce the complexity of the estimation by applying a random projection into a lower dimensional space. This reduces the computational complexity of computing the RBF kernel of the Spectral Stein Gradient Estimator.

The devised approach outperforms other mutual information maximization techniques.

On Mutual Information Maximization for Representation Learning

Many representation learning approaches are based on the InfoMax principle, where a good representation should maximize the mutual information between the data and the learnt representation (Linsker 1988).

Recently, novel approaches for new lower bounds on the MI and modern CNN architectures have resurged the approach. However:

MI is hard to estimate
Invariant under bijections
Does not yield good clustering representations

So why doe these approaches work so well?

Modern approaches do not maximize MI between data and representation, but between different views of the same input (higher level aggregation and lower level aggregation) which is a lower bound on the original InfoMax objective. Thus if these views encode low level information, such as pixel noise, they would not yield high mutual information, whereas if high level features such as “catness” would yield high mutual information on different crops of a cat image.

Experiments: Maximize MI between bottom and top half of image and evaluate performance using linear classifier.

Usage of bijective encoders: These preserve mutual information completely! Thus the true MI between the segments is actually remains the same during training. Nevertheless, the lower bounds of the estimators increases slightly during training and the classification accuracy of the derived representations increases strongly. Thus the estimator favors “good representations” despite any solution maximizing the MI!
Encoders which could be injective or monojective: MLPs with skip connections initialized to the identity mapping. Here the estimators favor hard to invert mappings even though the initialization (identity) maximizes the mutual information. The estimators thus bias towards good representations for classification, but tend towards hard to invert mappings which reduce the true mutual information!
Impact of encoder architecture: Different architecture with same MI estimator values lead to very different performance in terms of classification. Thus the value of the estimator is insufficient to explain performance. Inductive bias of architecture responsible for good performance?

Seq2Seq models

Are Transformers universal approximators of sequence-to-sequence functions?

TLDR: Yes.

Maybe not as there are quite some structures which could potentially limit expressive power: All tokens experience the same transformation. Only pairwise interactions between tokens are possible.

Paper shows that there always exists a transformer network with small width and unlimited depth that can approximate any equivariant function arbitrarily accurately.

What about positional encodings? These remove the restriction of permutation equivariance and allow a transformer to approximate any continuous seq-2-seq function.

Further, the authors show that inclusion hybrid architectures which for example include convolutional layers in between attention layers actually can improve performance.

Mogrifier LSTM

Core modification to LSTM: Gate the hidden state using input, and gate the input using the hidden state in an alternating fashion. The input and hidden state are then fed into the LSTM which then gives rise to a new hidden state. This leads to better performance than the baseline LSTM model and pushes the new LSTM closer to transformer networks in terms of performance.

Why does it work though? Potential explanations

Contextualized embeddings: The procedure could lead to an embedding which accounts for the actual context of the work and not only its mean context such as in word embeddings. Experiments indicate that this is not sufficient for explaining the performance on character level tasks and synthetics datasets.
Multiplicative interactions: Not really clear how this would improve performance.
Many more

None of them really explain the Mogrifier LSTM. The authors performed several experiments and none of them was really conclusive. There are solely indications. For example the Mogrifier performs better on a copy task if the sequences are long. Further, it this performance gap becomes larger as the vocabulary size of the input sequences increases.

Reformer: The Efficient Transformer

Combine two techniques to make Transformers more memory efficient and scalable with respect to training time:

RevNets (Invertable version of ResNet): As the computation can be inverted, it is not necessary to store all downstream activations for a back-propagation pass. They can be dynamically recomputed when needed. The reversibility of the connections is enabled by only applying the layer to half of the input and adding its output to the other half. This is a bit similar to the strategy of RealNVP. Interestingly, this does not reduce the performance.
Chunking of computations through the FeedForward NNs: Prevents having to store all intermediate activations at once.
While one could in theory also chunk the computation of the attention mechanism, the quadratic scaling of attention will lead to severe issues with respect to speed

Tackling the attention computation and making it scalable: Main issue: while we compute all values in the preattention matrix, the softmax converts these into a matrix of the same size, where many values are very close to zero. The attention matrix is sparse. How can we use the sparsity? Use variant of locality sensitive hashing: Allows sorting vectors with a high dot product into buckets. Thus one can simply compute the attention within each bucket and already cover most of the variance. Use shared QK attention (where the query and key are the same, which apparently seems to be as powerful as regular attention). Then bucket QK via LSH and sort according to the bucket. In order to exploit parallelism chunk the sorted array into fixed sizes and allow attention to attend within the chunk and the previous chunk in if the bucket ids match in order to cover case when chunking slitted buckets. In order to avoid problems with the probabilistic nature of LHS the process is repeated with multiple hash functions.

The results indicate that with more hash functions, the model converges to the performance of a full attention model.

Understanding Deep Learning

Four Things Everyone Should Know to Improve Batch Normalization

Things that are wrong with BatchNorm:

Inference Example Weighing: During training, the influence of the current instance on the batch statistics is still $\frac{1}{B}$, whereas during testing the instance does not contribute at all as only the moving averages are used for computing the normalization. The authors suggest to reparametrize the mean and std-deviation of batch norm during test time to reintroduce the dependency on the instance:
\[\begin{aligned} \mu_i &= \alpha E[x_i] + (1-\alpha) m_x\\\\ \sigma_i^2 &= (\alpha E[x_i^2] + (1-\alpha)m_x) - \mu_i^2 \end{aligned}\]
where, $\alpha$ is a hyperparameter which can be tuned after training on the validation data. This can also be done after a model has already been trained with regular batch norm and was shown to improve performance at test time.
Ghost Batch Norm: Originally developed for multi GPU and large batch training scenarios, Ghost Batch norm normalized across subsets of each batch instead of the complete batch. The authors show that this increases performance even for single GPU medium batch size training (by inducing additional noise during training?)
Batch Normalization and Weight decay: Applying weight decay on the shift and scale parameters of batch norm is unstudied. Authors show a slight improvement of performance. Yet, for this to be the case it is necessary that the path from BN to output does not pass through additional BN layers as it would amplify the effect with increasing depth.
Generalizing Batch and Group Norm: For small batch regime when the application of vanilla batch norm is not possible, suggest to normalize over both channel groups and examples.

The authors show that combining all the above proposed techniques can lead to an improvement of up to 6% in some scenarios. In general, I think the first approach is the most relevant and interesting.

On the Variance of the Adaptive Learning Rate and Beyond

Warmup: Linear increase on learning rate, seems to be critical for some learning tasks (such as in the case of Transformer models).

Experiments showed that without warmup, the gradient distribution gets distorted towards small values. This happens in the very beginning of training withing the first 10 updates. With warmup this distortion doesn’t occur and the gradient distributions remains largely the same compared to the vary beginning of training.

Why does warmup improve the convergence and mitigate this effect? Authors suggest that the adaptive learning rate is of very high variance in the beginning of training, due to the lack of samples used for computing the moving averages.

Set up two control experiments to verify if this is actually the cause of problems:

Adam-2k, which provides Adam with additional 2k samples for estimating the variance of the gradient (without any updates to the weights!). This leads to a learning curve which is extremely similar to using warmup.
Adam-eps: Increase the value of epsilon in the Adam implementation. This term is usually added to the square root of the variance estimate. By increasing eps the influence of the estimated variance becomes lower. This leads to better convergence than without warmup, but still shows some difficulties during training.

The authors suggest a rectification term to mitigate the issue of high variance in the adaptive learning rate, which basically deactivates adaptive learning rates when the variance estimate would diverge.

Experiments: Astonishingly, the results indicate that this corrected Adam implementation RAdam, is significantly more robust to the selection of the learning rate. Further, in contrast to warmup, it is not required to tune any parameter to reach optimal performance (which is the case for the length of warmup. Depending on the initial learning rate, longer or shorter warmup could be required). Cool.

The Implicit Bias of Depth: How Incremental Learning Drives Generalization

Set up linear model $f_{\sigma}(x) = <\sigma, x>$ with reparameterizing using auxiliary variables: $\sigma = w_1 \cdot w_2 \cdot w_3 \dots$ and train via gradient descent.

Analyse the gradient flow and show that the learning dynamics are different between using only $\sigma$ and the formulation using auxiliary variables. In the not deep model, all values are learned at the same time, whereas with increasing depth the values are learnt incrementally. The authors conjecture that this is the cause of sparsity in these type of models. Parameters that most decrease the loss, are decreased first, if there is a solution which has few non-zero values, this approach will most likely find it.

The authors formalize the notation of incremental learning, and derive conditions where it would occur. These conditions become much less strict as the model becomes more deep. Thus, deeper models allow for incremental learning to occur more easily and deeper models are more biased towards obtaining sparse solutions.

Empirical results show that incremental learning occurs also under relaxed assumptions.

Towards neural networks that provably know when they don’t know

Probabilistic model: Decompose $p(y|x)$ into in and out of distribution part.

\[p(y|x) = \frac{p(y|x,i) p(x|i) p(i) + p(y|x,o) p(x|o) p(o)} {p(x|i) p(i) + p(x|o) p(o)}\]

Gives rise to generative models of in distribution data $p(x|i)$, out of distribution data $p(x|o)$ and probability of the label given data is out of distribution $p(y|x,o) = 1/M$. Requires a out of distribution data! By using Gaussian Mixture Models as generative models, the authors can prove that the model would be not-confident for areas which differ significantly from the training data. Further they show that with their approach they can guarantee that entire volumes would be assigned low confidence.

Experimental results indicate state of the art out of distribution detection without reducing classification performance.

Truth or backpropaganda? An empirical investigation of deep learning theory

Suboptimal local minima DO exist in the loss landscape: Constructed proof based on high bias neurons which force ReLU units to function as the identity map. These neurons can then kill other ReLU neurons, constructing a smaller NN embedded inside the NN. Experiments show than when initializing with high bias or high variance of bias the test networks converge to suboptimal local minima. Suboptimal local minimal do exist, but are avoided by careful initialization.

Low l2 norm parameters are not better: Low l2 norm motivated from many directions: SVMs, generalization theory, Induced development of weight decay.

Empirical test: Use weight decay with norm bias, such that it increases the norm relative to weight decay. This improved performance across architectures and datasets and even improved performance without batch normalization.

Neural Tangent Kernel patterns: While the tangent kernel was confirmed to become constant in convnets, the tangent kernel does not become constant in more involved architectures such as resnet and others with skip connections.

Low rank layers: Experiments show that theoretical results do not hold in real world. Maximizing the rank outperforms rank minimization. Rank minimization also produces less robust networks.

What graph neural networks cannot learn: depth vs width

Previous results indicate that the GNN message passing model is equivalent to a 1WL-test and thus not universal.

These results have been shown on anonymous graphs (where the nodes have no features). The first result of paper indicates that GNNs with message passing are universal if they have “powerful layers”, are sufficiently deep and wide and if the nodes are given unique features. If these do not exist in the graph, then they can in theory be randomly assigned, but this would probably lead to issues with generalization. Experiments confirm this:

On the detection of a 4-cycle task a GNN without node labels can only reach very poor performance (provably).
Degree features slightly improve performance a lot but don’t lead to optimal performance
Assigning unique features based on a canonical ordering leads to perfect performance on train and test
Random features leads to optimal performance on train but not on test and thus generalize very badly

Second result focuses on what cannot be learnt by GNNs with message passing. Suggests that the GNN size should depend on the number of nodes $n$:

Cannot solve many decision, optimization, verification and estimation problems unless:
\[depth \times width = \Omega(n^d) \text{ for } d \in [0.5, 2]\]
Dependence on n even if task appears local $\Omega(n)$!
Hard problems (maximum independent set, minimum vertex cover, coloring) require $\Omega(n^2)$

All in all a very good study on which can be helpful when picking hyperparameters for GNNs.

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

While Vanilla SGD is theoretically optimal it is in practice worse than extensions such as SGD with momentum and Adam. Where does this gap come from?

Start insight: Proof of SGD optimality relies on three assumptions:

Differentiability
Bounded second moments
L-smoothness which is the focus of this talk

L-smoothness is actually a very strict criterion. Empirical evaluation shows that smoothness of a NN changes dramatically and that it correlates with the gradient norm.

Authors suggest a relaxed smoothness criterion (l0, l1 smoothness), which would account for the empirical observations better then the strict criterion.

Show with the relaxed smoothness criterion, there is a dependency of SGD convergence on the maximal norm of the gradient. This means Vanilla SGD would not converge if the gradient is not upper bounded. Further, the paper shows that clipped gradient descent does not have a dependence on the maximal norm of the gradient.

High level intuition: With clipping the SGD can traverse non-smooth areas, without clipping this would lead to divergences. This is especially the case when training with high learning rate.

Other - Normalizing Flows, Robustness, Meta-Learning, Probabilistic Modelling

Invertible Models and Normalizing Flows

Reference: https://arxiv.org/pdf/1908.09257

Generative networks paradigm:

Variational Autoencoder (previously called Helmholz machine)
Generative Adversarial Networks using Noise Contrastive Estimation (the discriminator)

Two parts of determine the likelihood of a normalizing flow:

likelihood of transformed variable in latent space given prior (push everything to zero if we assume N(0,1))
determinant of the jacobian penalizing strong compression (similar an entropy regularization term)

Sampling:

Sample from prior
pass through flow and you are done

History of flows:

First:
- Require square weight matrices
- strictly increasing activation functions
- Should make whole network invertable Issue: Computing the determinant of the jacobian is problematic! Space: $\mathcal{O}(d^2)$, runtime from $\mathcal{O}(d!)$ to $\mathcal{O}(d^3)$
Autoregressive models: Have triangular jacobian!
Using LU decomposition to enforce triangular jacobian: Works really bad.
Coupling layers (NICE): Use function on fraction of the inputs to scale and shift the other fraction. Is invertable and has a very easy to compute jacobian! Actually got rejected twice during the process!
Real NVP: Batch normalization, Convolutional NN, small extensions

Relevance of log-likelihood:

Sample quality vs. log-likelihood: Log-likelihhod and sample quality do not have to be aligned. But it seems that (until now at least) that there at least seems to be a very strong correlation (Evtl. look at Theis & van den Oord et al. 2015).
Density as a measure of typicality (for anomaly detection): Role in typicality is questionable. Until now it does not seem to be aligned.

Future directions:

Learning flows on manifolds
Add in prior knowledge into the flow, such as symmetries
Discrete change of variables, requires tricks (such as continuous relaxation) for backprop
Variational approximations, mapping discrete variables to continuous distributions, using dequantization
Adaptive sparsity patterns
Non-invertible models (Dinh 2019)

Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distribution Tasks

Focuses on extending MAML to unbalanced test datasets and out of distribution tasks. This is mostly implemented by adapting the inner gradient update according to the following formula:

\[\begin{aligned} \theta_0 = \theta * z^\tau \\ \theta_k = \theta_{k-1} - \gamma^\tau \circ \alpha \circ \sum_{c=1}^C \omega^\tau_c \nabla_\theta \mathcal{L}_c \end{aligned}\]

where $*$ is abused in notation to “multiply if component in $\theta$ is a weight and add if component in $\theta$ is a bias” and $\circ$ represents element wise multiplication.

The components have the following roles:

$\omega^\tau_c$ scales the gradient of all instances with class c. Allows to take larger gradient steps for classes with more examples in order to tackle class imbalance (Is this equivalent to computing a weighted loss?)
$\gamma^\tau$ scales the size of gradient steps on a per task basis in order to tackle task imbalance. Reasoning: Larger tasks should have larger $\gamma$ because they can rely more on task-specific updates, while smaller tasks should have small $\gamma$ because they should rely more on meta-knowledge.
$z^\tau$ relocates the initial weights for each task. This allows tasks which dissimilar (out of distribution) to the original meta-learnt tasks to be shifted further away in weight space, while task which are closes the meta-learnt tasks to remain closer. Tackles out-of-distribution tasks

The parameters are inferred using amortized variational inference, where the inference network is shared across all tasks and the parameters are computed conditionally on summary statistics derived from the task instances (see below).

In general, I find it very interesting how the authors incorporate additional terms into the inner gradient computation loop and how the inference of the associated parameters is implemented using variational inference on the summary statistics.

Meta Dropout: Learning to Perturb Latent Features for Generalization

Suggest to additionally learn how to perturb intermediate representation during meta learning using input dependent Dropout. This is done by computing parameters of the noise distribution dependent on the output of the previous layer. This results in the following preactivation:

\[\begin{aligned} a^{(l)} &\sim \mathcal{N}(\mu^{(l)}(h^{(l-1)}), I) \\\\ z^{(l)} &= Softplus(a^{(l)}) \\\\ h^{(l)} &= ReLU(W^{(l)} h^{(l-1)} \cdot z^{(l)}) \end{aligned}\]

The parameters of the function $ \mu $ and the weights of the layer $W$ are trained using the meta learning objects of MAML. Yet importantly, the weights of $\mu$ are not fitted in the inner training loop of MAML as this would allow the algorithm to “optimize away” the perturbation. Interestingly they use a lower bound to optimize the inner loop. This is not really motivated, but the authors draw a connection to variational inference, where q and p are from the same family and thus the KL would equal 0. The results look very good actually.

On Robustness of Neural Ordinary Differential Equations

Empirical study of Neural ODEs compared to ResNets. Show that they are more robust to random noise and to adversarial attacks. Further suggest Time-invariant steady state ODE (TisODE), to additionally improve robustness. This adds time invariance in the ODE function and a specific steady state condition.

They argue that the NeuralODE is more robust to perturbations as the ODE integral curves do not intersect (See Theorem 1). The argument is that due to the non intersecting property, each perturbation of less than $\epsilon$ would remain sandwiched within the $\epsilon$ ball and its distance from the unperturbed final position would remain upper bounded. Quite a cool insight. The extensions of the authors in TisODE aim at controlling the differences between neighboring integral curves to enhance NeuralODE robustness.

Why Not to Use Zero Imputation? Correcting Sparsity Bias in Training Neural Networks

Using Zero imputation induces a bias in the network! Often the prediction correlates with the fraction of imputed values more than with respect to the similarity of the instances! This effect is present across a large variety of datasets including medical data.

Authors coin term “Variable sparsity problem”, where the expected value of the output layer of a NN depends on the sparsity of the input data. The existence of this problem is derived theoretically based on a few assumptions.

Paper suggest that imputation with non zero values is helpful, by stabilizing the number of known entities. Phrase imputation as injecting plausible noise. Still it should be considered as injecting noise into the network!

Authors suggest to use sparsity normalization:

Divide the input data by the L1 norm
Then the average activation of the subsequent layer would be independent of data sparsity
This is a simple preprocessing scheme of the input data!

Authors show theoretically that this preprocessing step can fix the variable sparsity problem. Empirical results are also in line with the theory.

Project: simple-gpu-scheduler - easy scheduling of jobs on multiple GPUs

Fri, 25 Oct 2019 00:00:00 +0000

Our research group has multiple servers each equipped with multiple GPUs. Unfortunately, these are not connected together in a cluster infrastructure, but instead, GPUs are assigned to individuals or on a per-project basis. This makes the execution of many jobs using multiple GPUs difficult.

While it would be possible to connect the servers to a small cluster with a scheduling system (we are working on it!), this can take a long time until it is set up. Especially in academia where the maintenance and setup of servers is often delegated to the departments IT-team, the path to implementing a small scale cluster is littered with bureaucracy. Questions like: Who is responsible for xyz?, How are the software installations managed?, Which alterations should be done to have the correct network infrastructure? can take ages before they are answered and appropriately implemented. In our particular case we had the idea of refurbishing the cluster more than a year ago and are still no where close to having it up and running.

The Alternative - `simple-gpu-scheduler`

Driven by the need of having something as a bridge between our current server setup and the to be beautiful world of our personal cluster I decided to write a small Python package to do the job. This is how simple-gpu-scheduler was born.

How it works

Software based on the CUDA library (such as most deep learning frameworks and many others), can be constrained to only seeing certain GPUs using the CUDA_VISIBLE_DEVICES environment variable. The simple-gpu-scheduler accepts commands and executes them while setting the environment variable to a currently free GPU. As soon as the job finishes, the GPU is released and the next job is allocated to it. This allows to always utilize all of the GPUs to the maximally possible extent ¹.

Usage

I wanted to make simple-gpu-scheduler as simple and flexible as possible and thus tried to adhere to the KISS principle. Like many UNIX tools it thus takes it’s input from stdin such that it can be combined with other tools. This allows reading commands from a list, or even from a fifo (first in first out), such we can build a fully functioning queuing system. For further reference please consult the GitHub page of the project.

Simple example

Suppose you have a file gpu_commands.txt with commands that you would like to execute on the GPUs 0, 1 and 2 in parallel:

$ cat gpu_commands.txt
python train_model.py --lr 0.001 --output run_1
python train_model.py --lr 0.0005 --output run_2
python train_model.py --lr 0.0001 --output run_3

Then you can do so by simply piping the command into the simple_gpu_scheduler script

$ simple_gpu_scheduler --gpus 0 1 2 < gpu_commands.txt
Processing command `python train_model.py --lr 0.001 --output run_1` on gpu 2
Processing command `python train_model.py --lr 0.0005 --output run_2` on gpu 1
Processing command `python train_model.py --lr 0.0001 --output run_3` on gpu 0

Hyperparameter search

One of the most common use cases for running many jobs in parallel is hyperparameter search. For convenience I added a small script simple_hypersearch which generates commands to evaluate a hyperparameter grid. Here is a small example of how to generate all possible configurations and execute them in random order:

simple_hypersearch "python3 train_dnn.py --lr {lr} --batch_size {bs}" -p lr 0.001 0.0005 0.0001 -p bs 32 64 128 | simple_gpu_scheduler --gpus 0,1,2

Final words

I hope some of you find the software useful. Feel free to open issues and feature requests if you need any further features. See you next time!

GNU parallel can be used to do something similar (see the HN discussion). It is significantly more flexible, which IMHO comes at the cost of ease of use. ↩

Organizing projects and notes with vimwiki and VimR

Fri, 12 Apr 2019 00:00:00 +0000

After a long time of absence I decided to reactivate my blog! So here comes another post related to optimizing the workflow of a PhD student in Computer Science.

The problem

If there is one thing you should do a lot in your PhD studies it is reading. Papers, course materials, books, blog articles. Over time it becomes hard to keep track of everything read and the thoughts one had while reading. While it is possible to track this kind of information (especially for papers) in reference management software, it is usually not desired/possible to store information from heterogeneous sources in such a format.

A potential solution

As I have developed a strong resistance against using practically any other editor besides (Neo)vim ¹, I searched for a solution within this framework. The most lucrative solution for my taste was vimwiki. It allows to:

Organize notes from multiple projects in a hierarchy
Supports navigating between these in a flawless fashion
It is very easy to create new pages
Can be configured to use a markdown compatible syntax
As everything is text, changes can also easily be synchronized git and even be hosted in a personal wiki on github

Configuration

I configured vimwiki such that it saves all my notes in a folder PhDWiki in my home directory, uses the markdown syntax and only gets activated when files inside this notes folder are being edited (I want to have the possibility to configure my general markdown integration independently of vimwiki). This can be achieved by adding the following settings to your .vimrc or init.vim:

let g:vimwiki_list = [{'path': '~/PhDwiki/', 'syntax': 'markdown', 'ext': '.md', 'index': 'Home'}]
let g:vimwiki_global_ext = 0
" Install the plugin, this uses vim-plug anything else should also do
Plug 'vimwiki/vimwiki'

Working with equations in the VimR preview

While it is possible to configure vim such that some symbols in equations are replaced, this usually does not really improve the readability of the equations. This is especially due to very limited support of most operation such as sub- and superscripts. For revising my writing, I rely on the preview provided by VimR. While I deactivated all additional features of the GUI such as file browser, buffer view etc., features such as the markdown and html preview are the most beneficial components of a GUI-interface compared to running NeoVim in the terminal.

Unfortunately, this preview does not support equations which are omnipresent in courses and papers out of the box. Yet not all hope is lost, as thanks to a nifty javascript tool MathJax it can be made to do so. VimR uses a tiny browser with full javascript support to render markdown and html pages, thus by modifying the markdown template to include MathJax we can patch in the support of equations.

For this we need to edit the template file VimR.app/Contents/Resources/markdown/template.html, by adding the following lines just before the html tag </head>. The relevant region of the edited file should look as follows:

  <title></title>
  <script type="text/javascript">
    window.MathJax = {
      extensions: ["tex2jax.js"],
      jax: ["input/TeX", "output/HTML-CSS"],
      tex2jax: {
        inlineMath: [ ["\\(","\\)"], ['$', '$'] ],
        displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
        processEscapes: true
      },
      "HTML-CSS": {
          availableFonts: ["STIX"],
          preferredFont: 'STIX',
          webFont: 'STIX-Web',
          imageFont: null
      }
    };
  </script>
  <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js" async></script>
</head>

This adds MathJax in the most light configuration to the markdown template, allowing it to render math equations. Below you can see how the result looks:

While this works well in practice, it is still a rather hacky solution. For example sometimes it is necessary to wrap the equation into a set of <p></p> tags to prevent the markdown renderer from destroying the equations. To make the approach more integrated and robust to updates of the software, I opened an issue here. I will edit this post if there is any follow up development.

I am planning to do a separate blog post on the benefits and disadvantages of using an editor like Vim for most of your work. ↩

Setting up a Neovim and pipenv based Python development environment

Tue, 10 Apr 2018 17:00:00 +0000

I think everybody has been there after some time:

multiple python venvs for dozens of projects
huge requirements.txt files containing all dependencies of dependencies
JuPyter notebooks everywhere, including their dependencies

For the start of my PhD I decided to try to bring some order in the chaos of environments and dependencies by switching to pipenv. Furthermore, I show how to implement jupyter notebook style programming in a Neovim()/Oni() development environment.

pipenv

Pipenv is a tool that allows to manage project dependent virtual environments, while additionally enhancing reproducibility by using checksums of installed packages (Pipfile.lock). It is the recommended package manager by http://www.python.org, is straightforward to install and also supports loading project specific environment variables using an .env file.

Virtual environments in pipenv are not stored in the repository of the project, also there are no additional files besides the Pipfile and the Pipfile.lock (these are actually good to have to ensure reproducibility). The strategy is to avoid installing packages outside of pipenv (for example using pip), which automatically ensures that all project dependencies are tracked and up to date. Overall pretty neat in my opinion.

On macOS with brew you can be up an running with Python 3 and pipenv using the following commands:

brew install python3
brew install pipenv

Afterwards, we can install JuPyter in the global Python3 environment (or the users Python3 environment by adding the --user flag) using:

pip3 install [--user] jupyter

Avoiding reinstalling JuPyter for all venvs

Now that we have the JuPyter installed in the global environment we don’t want to have to reinstall all the dependencies for each virtual environment/project we work on. The trick is, that we only need to install the JuPyter kernel in the individual virtual environments, and register these kernels in the global installation.

The kernel package that is required for a jupyter/IPython interface (notebook, QT Console, console) to communicate with an environment is ipykernel, which can be installed as an development dependency in pipenv (pipenv install --dev ipykernel). Afterwards, the new kernel needs to be registered with the global JuPyter installation. In order to make the whole process easier, I wrote added a small bash function to my ~/.bashrc to create a Python 3 environment, install ipykernel as a development dependency and register the new kernel for usage in the global JuPyter installation.

To get the same functionality, add the following lines to your ~/.bashrc:

init_python3_pipenv () {
   echo "Setting up pipenv environment"
   pipenv install --three
   echo "Installing ipython kernel"
   pipenv install --dev ipykernel
   # get name of environment and remove checksum for pretty name
   venv_name=$(basename -- $(pipenv --venv))
   venv_prettyname=$(echo $venv_name | cut -d '-' -f 1)
   echo "Adding ipython kernel to list of jupyter kernels"
   $(pipenv --py) -m ipykernel install --user --name $venv_name \
   --display-name "Python3 ($venv_prettyname)"
}

A new project can now easily be set up using:

mkdir ~/Projects/MyAwesomeProject
cd ~/Projects/MyAwesomeProject
init_python3_pipenv

JuPyter notebook style programming in Oni/Neovim

For the vim users out there, I will explain how you can convert vim into an interactive developing environment similar to working in a jupyter notebook or using an ide like spyder. This setup involves launching an IPython kernel in a QTConsole, establishing a remote connection to the kernel using the nvim-ipy plugin and configuring the QTConsole such that it outputs the results of remote commands. IMHO the result look quite acceptable:

Running QtConsole from vim using correct kernel

The benefit of the QT Console is that the output of a command is directly visible, allowing interactive programming with intermediate plots and variable inspection.

On macOS, the dependencies (mainly QT) of the QT Console can be installed via brew. For other operating systems please refer to the QT console documentation. As I only use Python 3 for development on an macOS operating system, I got the QT Console up and running using the following commands:

brew install sip --without-python@2
brew install pyqt --with-python3 --without-python@2

For integration with vim I use the nvim-ipy vim plugin, which can be installed using your favorite vim plugin manager (I personally use vim-plug). The following command rely on the installation of nvim-ipy. To allow the QT Console to easily be launched using the correct kernel and from within vim, I defined the following vim functions in my init.vim

function! GetKernelFromPipenv()
    let a:kernel = tolower(system('basename $(pipenv --venv)'))
    " Remove control characters (most importantly newline)
    return substitute(a:kernel, '[[:cntrl:]]', '', 'g')
endfunction

function! ConnectToPipenvKernel()
    let a:kernel = GetKernelFromPipenv()
    call IPyConnect('--kernel', a:kernel, '--no-window')
endfunction

function! AddFilepathToSyspath()
    let a:filepath = expand('%:p:h')
    call IPyRun('import sys; sys.path.append("' . a:filepath . '")')
    echo 'Added ' . a:filepath . ' to pythons sys.path'
endfunction

command! -nargs=0 ConnectToPipenvKernel call ConnectToPipenvKernel()
command! -nargs=0 RunQtConsole call jobstart("jupyter qtconsole --existing")
command! -nargs=0 AddFilepathToSyspath call AddFilepathToSyspath()