ICLR 2020

ICLR 2020, is being held for the first time virtually, I volunteered for the conference in testing the services.
This is my first conference and I'm pretty excited about it.

Climate Change with ML Forecasts (Panel)

Think about Flood forecasting
- Climacell, Boston
- East Africa has seen unprecedented changes in climate, and the disaster of the flooding would've been handled if there were local information and more data points across the place.
Cobuilding solutions with academic communities, users, local government bodies, working with stakeholders, experts on the problem.
Explainable AI
- Causing problems with deployments because there is no trust.
- Emerging market bodies
  - Trust breaks down when not working with local govt bodies and takes time to seek out the right people
https://www.climatechange.ai
- Tackling climate change with machine learning https://arxiv.org/pdf/1906.05433.pdf

Safe and Reliable Machine

https://iclr.cc/virtual/workshops_6.html

Poly-Encoders: Architectures and Pretraining Strategies for Fast and Accurate Multi Sentence Scoring - https://openreview.net/pdf?id=SkxgnnNFvH

Rank Candidate Responses Quickly and Accurately
Model
- Transformer Encoder
- BERT -> MLM + NSP
Architecture
- Bi-Encoder -> Two BERTs, Input and each candidate, get scores by dot product between them
- Cross-encoder -> Concatenate and then add send through BERT
Cross encoder has better performance, Bi-encoder is faster
Poly-Encoder

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework - https://openreview.net/pdf?id=S1l-C0NtwS

Conduct Systematic studies of two popular cross-lingual word embeddings -> Cross Lingual Alignment vs Joint Training
Contributes a unified framework
Cross lingual embeddings mean words with similar meanings should be close to each other, to facilitate cross lingual knowledge transfer
- Goal is to learn a shared vector space for multiple languages.
Cross Lingual
- Train embeddings differently, obtain dictionary and learn a projection matrix
- Problems of under sharing and no words are aligned
Joint training
- All embeddings trained at the same time, obtains joint vocabulary and trains using concatenated mono-lingual corpora
- Problem of over sharing and language specific word are poorly aligned
Unified Framework
- set of shared embeddings obtained using unsupervised learning
- Vocabulary reallocation
  - Words appearing exclusively in a language are reallocated and words appearing with similar frequency are kept at the same level.
- Alignment Refinement
  - Off the shelf alignment methods used to refine alignments across the non sharing embedding sets.
Relative performance is task specific.

Generalization through Memorization: Nearest Neighbor Language Models - https://openreview.net/pdf?id=HklBjCEKvH

Asks the question, "Can explicitly memoryizing the training data helps generalization without the added cost of training"
Can scale to larger text corpora and helps domain adaptation
Start with any Neural autoregressive LM, feed it our test contest, this gives a distribution over the next word and a fixed sized embedding for the context
- This fixed sized representation is used to query the nearest neighbour datastore offline
Constructing the datastore
- Forward pass over the examples with a PLM
- This gives context representations over the examples and this gives the keys of the datastore and the targets are the values, to be used during inference.
Query is a test context representations and the differences are calculated between it and a subset of keys from the datastore
- Prune and keep the top k
- normalize the distances and aggregate over multiple occurences to give knn distribution
- Last interpolate LM and knn components to get final KNN distribution
Wikitext 103 best perplexity

Papers Missed -> https://openreview.net/pdf?id=HJe_Z04Yvr - Adjustable Real Time transfer

Would like to mention that evaluating tasks like document summarization (abstractive) is a very difficult task because its very hard to build a formal notion of what abstraction is, and that current datasets (like CNN/Daily Mail and Newsroom) dont really capture all aspects of the problem. While an objective evaluation is pretty difficult, there have been a variety of papers with a way to interpret the model outputs. A variety of biases in datasets and metrics were mentioned in this paper https://www.aclweb.org/anthology/D19-1327.pdf.

A Probabilistic Formulation of Unsupervised Text Transfer - https://openreview.net/pdf?id=HJlA0C4tPS, https://github.com/cindyxinyiwang/deep-latent-sequence-model

Text Style Transfer -> Change the style of the sentence and keep the semantic meaning unchanged.
sentiment transfer, formality transfer, author imitation, machine translation -> other kind of text style transfer
Adversarial Loss + Auto Encoding loss (shen et al 2017, yang et al 2018)
Lample et al 2019
The model uses amortized variational inference
- Subtly different unsupervised objective
- SOTA and has probabilistic perspectives
Transduction assumption and LM model is trained over prior.

Plug and Play Language models: A simple approach to controlled text generation - https://openreview.net/pdf?id=H1edEyBKDS

Gpt2 generates generic sentences, the model can be steered to make sentiment based sentences
PPLM
- a LM predicts the token distribution for the next token given the input context
- Train a small attribute model which models positivity
- Control degeneration by modeling p(x)
Language Detoxification
Combining attribute models

ELECTRA - https://openreview.net/pdf?id=r1xMH1BtvB

ULMFit, ELMO, GPT -> generate sequence one token at a time
BERT, XLNET, RoBERTa, ALBERT -> MLM pretraining, only learning from typically 15% of tokens which are masked out
Replaced Token Detection
- Replace tokens which sort of fit into the context but dont exactly fit into the context
- Bidirectional learning from all positions
- The replacements need to be generated by a generator trained jointly during training

VL-BERT - https://iclr.cc/virtual/poster_SygXPaEYvH.html

Visual-Linguistic Tasks
The representations were earlier combined in a task specific way which is hard and time consuming
VL BERT
- Task agnostic architecture
- Visual-Linguistic pretraining
  - in addition to bert
  - Add image regions
  - Visual Feature embedding
- Visual Linguistic Corpus: Conceptual Captions
- Text-only Corpus
Pretraining tasks
- MLM with visual clues (the word is masked in the sentence, and the image is given to predict the word)
- Masked ROI classfication with linguistic clues (a portion of the image is masked and the sentence is given to predict the masked area)
Works very well on VCR, VQA and COCO.
Pretraining procedure helps for better embeddings.

On the weakness of Reinforcement Learning for NMT - https://openreview.net/pdf?id=H1eCw3EKvH

In RL, training and inference based on system outputs. (no exposure bias)
- In MT, First warmup with MLE and then use REINFORCE or MRT
- MRT is not guaranteed to converge
Reinforcement with constant 1 is as good as with BLUE
Peakiness Effects
- RL only learns from what it samples and with enough samples it'll maximize the reward
- If the probability distribution peaky, many sample are needed
  - But if its already peaked or if a model sees many "okay" rewards, it'll become more probable
- A prediction which gets high reward, its made more probable.
MT is already peaked, and RL makes predictions even more peaked.
The most probable tokens are extremely probable
- The top predictions hardly ever change.
1] How much do you think your work translates to other applications of RL in NLP wherein sequence generation is involved? Is peaky-ness an issue specific to NMT?
2] So you mention in the talk, that the problem has an inclination to output more probable words, what do you think would happen if there was a regularization objective which forced the model to consider other words as well. For eg: Unlikelihood training. { THIS MIGHT WORK - EXPERIMENT}

Non Autoregressive for Dialogue State tracking - https://openreview.net/pdf?id=H1e_cC4twS

Limitations
- Unable to detect unseen slot values
- Tokens are generated one by one, expensive
- Dependencies among slots are not explicitly learned
NADST
- Fertility decoder
  - Model dependencies among (domain, slot) pairs
- State decoder
  - Construct input sequence as (domain, slot) x fertility
- Enables fast decoding of dialogue states

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech - https://openreview.net/pdf?id=H1guaREYPr

Inference and generation of faces from speech
Unified cross-modal inference and generation framework that can be trained in self-supervised way
Training inference module using negative sampling
Train conditional GANs using trained inference networks
- Relativisitic Conditional loss introduced

ReCLOR : Reading Comprehension Dataset requiring logical reasoning - https://openreview.net/pdf?id=HJgJtT4tvB

Logical reasoning is an ability to examine, analyze and critically evaluate.
Logical Reasoning
- 0% in MCTest and 1.2% in SQUAD
ReCLOR
- 6138 question
Different types of logical reasoning are incorporated

Lite Transformer with Long-Short Range Attention - https://arxiv.org/pdf/2004.11886.pdf

The FFN takes more than half of the computation
- Not desirable
Flattened transformer block, attention takes major computation
Attention in the original transformer and then added the following block.
- GLU-> Conv -> FC
Flattened the transformer block and LSRA introduced

2020 Vision - Ruha Benjamin

Technology is in the driver seat Assumption
- But we come up and imagine with this
Power of ML
- Think about Horizontal Form of Power
- Personal Slaves "again", who is the beneficiary of the technology
Deep Learning
- Think beyond computational depth and without historical and sociologicl depth is superficial
- Do machine learning practitioner learn about technologies destructive usages of automation
  - What all other things are these things being used for
- Structural Inequity
  - The bench example, discriminatory design, hostile design
    - Collectively respond to it
  - Can be built into technology as well
    - What spikes are we building into physical and digital infrastructures
Takeaways
- Racism is productive, it constructs
- Race and Technology are coproduced
- imagination is a field of action
Citizen App
- Race after Technology, Captivating technology books by ruha benjamin
- How are race affecting coming up with technology
- The californian app companies delivering apps to kenya (tala, branch companies) resulting into perpetual debt circuits
Duplicity of technological fixes (The sleepdealers clip)
The new jim
- Engineered inequity
- Default discrimination
  - Pro publica report on algorithm bias in parolees in a county
- Coded exposure
  - Being Included or excluded in the wrong way
- Techno-Benevolence
Advancing racial literacy in tech handbook
Ruined by Design By mike monteiro

IBM: NeuroSymbolic Hybrid AI

AI -> Machine Learning, DL, CV, NLP
- General AI -> Revolutionary
- Broad AI -> Disruptive and pervasive
- Narrow AI -> Emerging
Broad AI, Multitask, MultiDomain, Distributed AI, Explainable (This is what this lab deals with)
Gatys et al 2015, Brock et al 2018, Karpathy et al 2015
Objects out of context make the model fail, Wang et al 2018
ObjectNet
Pin yu chen et al 2017 - Adversarial, Xu et al 2019
CLEVR Dataset
Symbolic AI
Neural-Symbolic AI (reading list)
- Concept and Reasoning are usually entangled and hard for neural networks
- NS-VQA (Yi et al 2018)
- Neurosymbolic concept learner, ICLR 2019
- Neurosymbolic Metaconcept Learner Neurips 2019
- Dynamic Scenes Conterfactual ICLR 2020
- CLEVERER
- Neurosymbolic generative models srivastava et al 2020
- Neurosymbolic NLU Naacl 2019
- Neurosymbolic Code optimization shi et al 2019

READING

BERTScore - https://openreview.net/pdf?id=SkeHuCVFDr, https://arxiv.org/abs/1904.09675 (Check citations)

Improving NLG with Spectrum Control - https://openreview.net/pdf?id=ByxY8CNtvr

Unlikelihood Training

Followup work - https://arxiv.org/pdf/1911.03860.pdf
Sparse Text Generation - https://arxiv.org/pdf/2004.02644.pdf
Summarization Evaluation
- Correlation with human judgements for different types of summarizations. We found the WMT human judgements data to be extremely useful for this work. I don't know that an equivalent resource exists for summarization. There's a bit for caption generation, but we are largely going in blind. Another issue is the diversity of correct outputs. It's very likely that there are so many correct outputs for summarization, many very different from others. This makes token matching evaluation a bit inappropriate. I am not a summarization expert by a long shot, so I might be wrong here. (Yoav artzi comments)
https://arxiv.org/abs/1904.09751 Look at Citations

Tools list -> https://docs.google.com/spreadsheets/d/1jvJXusSCqqnDQD35RGEknVdZf4Ay7O3oZGafQgqPctA/edit#gid=0

Playing the lottery with rewards and multiple languages: Lottery tickets in RL and NLP - https://openreview.net/pdf?id=S1xnXRVFwH

Lottery ticket hypothesis suggests that current initialization has substantial room to improve.
This paper shows the existence of this phenomena in RL and NLP
Rewinding and Iterative pruning are both important for finding lottery tickets in NLP

Drawing Early Bird Tickets: Towards more efficient training of Deep Networks - https://openreview.net/pdf?id=BJxsrgStvr

Progressive Pruning and Training (J.Frankle, ICLR 2019)
Early Bird training proposed
- Existence of early bird tickets discovered
- propose a detector to find these early birds
  - Hamming distance based method
- Efficient training via EB tackets
  - Reduced training cost by upto 80%

Deep Learning for Symbolic Maths - https://openreview.net/pdf?id=S1eZYeHFDS

An interesting insight

What can Neural Networks reason about? - https://openreview.net/pdf?id=rJxbJeHFPS

Need AI which reason about the world
Better "Algorithmic alignment" implies better generalization
- Inductive bias of architectures formally defined
GNN similar to bellman ford algorithm i.e they align with the structure of the algorithm
GNN aligns with Dynamic programming as well.
GNN has limitations specifically on The subset sum (NP-hard problem)
Neural Exhaustive Search - Based on algorithm alignment achieves top performance on the task.
If

A Mutual Information Maximization Perspective of Language Representation Learning - https://openreview.net/pdf?id=Syx79eBKwr

Mutual Information Maximization
InfoNCE is Used (logeswaran et al 2018, van den oord et al 2019)
InfoWord; Deep Infomax(DIM; Hjelm et al 2019)
Performs better on squad wrt original metrics [CHECK]

Reformer: The Efficient Transformer - https://openreview.net/pdf?id=rkgNKkHtvB

https://iclr.cc/virtual/poster_rkgNKkHtvB.html

Augmenting Non-Collaborative Dialog Systems with Explicit Semantic and Strategic Dialog History - https://openreview.net/pdf?id=ryxQuANKPB

Non-Collaborative -> Where the agents have different interest.
Negotiation Dialogue
Persuader dialogue
FST-DA, FST-S (FST -> Finite state transducer)
Training With Greedy Algorithm
Automatically trained FST

On the Relationship between Self-Attention and Convolutional Layers - https://openreview.net/pdf?id=HJlnC1rKPB

Transformer has done very well on NLP tasks.
The transformer has also done very well on Vision Tasks.
- Bello et al 2019, Ramachandran et al 2019
Why does transformer work so well in vision?
- This paper shows that MHSA (Multi head self attention) can express any convolution
Proved by Reparameterization of MHSA into any CNN layer.
3 questions tackled by experiments
- Is the theoretical reparametrization learnable (Yes)
- could positional encoding be learned from scratch (Yes)
- Does Full self attention on image perform convolution (Yes)

Encoding Word Order in complex embeddings - https://openreview.net/pdf?id=Hke-WTVtwr

Position Embeddings in Transformers
PE maps from a position index to a n-dimensional vector
Transformer talks about TPE(Trignometric PE) fixed position embedding
- Can not be trained
- TPE
Desiderata for word function
- PFRD(Position free relative distance)
- Boundedness
Their method is more "general"
In text classification, better performance wrt PE/ TPE.
In machine translation, better performance.

Machine Learning: Changing the future of healthcare

https://www.youtube.com/watch?v=EVl5iMpX1cg, https://www.youtube.com/watch?v=RfK3D5dJV2Q
In the case of medicine, its not a well posed problem, so ML doesnt do very well. solutions are not well defined and solutions are very hard to verify.
Problem formulation
- Many ways to conceive of a problem
- Many ways to formalize each conception
- input is limited and output is human based
ML-AIM predictor, AutoPrognosis [ICML 2018]
- It needs to be interpretable
- Therefor INVASE [ICLR 2019] was developed
- Attentive State-Space [Neurips 2019], State of the art time forecast models
- Deep Sensing [ICLR 2018]
- GANITE, NSGP, Counterfactual RNNs
- All of these have been developed for one system
Clinical Predictive Analytics
- Automating the design of clinical predictive analytics
  - AutoML, let machine learning craft models for the variety of diseases, the variables and needs [Doesn't work well]
    - Dont capture uncertainty
    - limited to classification
    - Not interpretable
  - AutoPrognosis [Alaa & vds, ICML 2018] (A tool for crafting clinical scores)
  - Use Ensembles
    - To have uncertainty estimates
    - and to know the information loss
  - Each pipeline is a path of algorithms
    - Hard Learning and optimization problem
    - Bayesian Optimization doesnt work in this case because of high dimensions
    - Bayesian Optimization with structured kernel learning (Some algorithms have similar structure/ are correlated)
      - One gaussian process per set of algorithms
- Interpretability and Explainability
  - Need transparency, risk understanding, avoid implicit bias, discovery
  - Explainability -> Tailored interpretability
    - For the clinician, researcher and patient need different kinds of explanations
  - Current methods are tailored to one type of interpretation
  - Need model independent and post hoc manner way of interpretations
  - INVASE handles most of these issues
  - Understand what features are important for the model and the disease
  - Produce a transparent risk equation
  - Demystify black-box models using symbolic metamodels [Neurips 2019]
    - Current black box models using symbolic metamodeling, outputs some equation.
  - Interpretability using symbolic metamodeling in practice [Neurips 2019]
  - Metamodels can be used in two ways, Forward use, input is features and output is risk, and backward use, where input is desired risk reduction and output is featues
- Dynamic Forecasting
  - The markov assumption is harmful in clinical cases, history matters.
  - Clinical actionable models needed for patient level trajectory
    - Learn from complex data, clinical annotations, history etc.
  - Attentive state space models [2018,2019]
    - Combination of probabilistic structure of HMMs but use RNNs to model state dynamics
  - Who, when, what to screen? and how to screen ?
    - Deep Sensing [ICLR 2018]
    - Diseases atlas
- Individualized treatments effects causal inference
  - GANITE, NSGP, Counterfactual RNNs
  - Conterfactuals arent observed so the problem is hard.
  - Alaa, vander scharr JSTSP 2017, ICML 2018
    - A first theory for causal inference.
Theory Guides Model Design
www.vanderschaar-lab.com

Model Based Reinforcement Learning for atari - https://openreview.net/pdf?id=S1xCPJHtDB

Model Free RL has good results but require a lot of interaction, Model based has better sample complexity.
Novel architecture for video based prediction to be used as model.
- SimPLE
- The model is predicting the next frame, from the previous 4 frames
  - And this essentially builds a simulator of the game
Policy training with Random Starts
Mastering Atari, Go, chess and shogi by planning with a learned model

Differentiable Learning of numerical rules in knowledge graphs - https://openreview.net/pdf?id=rJleKgrKwS

Knowledge graph = Multi-graph with typed edges
- In practice we might have additional edges
Learn (numerical) rules from KG and complete missing edges
NeuralLP to impleting numerical rule matching
Contribution of this paper
- Efficient matrix-vector mult for numerical operators
- Assume values are sorted by the permutation matrices.

AI systems that can see and talk

important to explore the intersection of vision and language because
- Pushes boundaries
- Vision: Beyond "bucketed" recognition, long-tail, low-shot
- Language: Grounding, Reasoning
- Processing across continuous(vision) + discrete(language),
  - low level(vision) + high level(language)
- Control biases from both modalities
VQA (Visual Question Answering)
- The VQA v2
Typical Model Architecture
- ```
    Question Encoding -> Attention         -> Classifier
                             +
```
  Visual Feature extraction -> Multimodal fusion ->
- Bottom-up, Top-down Anderson et al 2019
- vqa.cloudcv.org
Challenges
- Strong Language Priors, insufficient grounding
  - Karpathy & fei fei et al 2015
  - Neural baby talk 2018 CVPR 2018
    - Allows robust captioning
    - Novel object captioning
    - The above are critical for real-world problems
    - nocaps.org
- Integrating Vision Modules
  - VQA - VQA-CP CVPR 2018 - GVQA CVPR 2018 - Difficult "questions" from challenge - All of the top models failed in OCR (Trend in 2018, 2017, 2016) - Integration of these technologies needed - TextVQA, (Look, Read, Reason, Answer, Paper)
    - CVPR Competition deadlines (mid may)
- Generic visio-linguistic representations
  - Earlier there were visuo-linguistic task specific models.
  - VilBERT (Neurips 2019)
    - Pretrain with masked region and masked sentence
    - 5 Downstream tasks
    - Got SOTA
  - VilBERT Multi-Task (CVPR 2020)
Challenges
- Centered around COCO
- Incorporating External Knowledge,
  - FVQA, OK-VQA
- Evaluation for downstream tasks; human-AI loop.
- Grounding in real applications
- VizWiz Gurari et al 2018
- Other Languages
- Studying Biases in VQA
  - Hendricks et al 2018, Gender Bias
  - ZHao et al 2018,
Time managment
AI + Creativity
Climate change

AI + Creativty

AI and Art lab, Rutgers university
Robbie Barrat: https://twitter.com/videodrome
"How three French students used borrowed code to put the first AI portrait in Christie's" https://www.theverge.com/2018/10/23/18013190/ai-art-portrait-auction-christies-belamy-obvious-robbie-barrat-gans
Helena Sarin: https://twitter.com/glagolista
Perceptual Engines: https://medium.com/artists-and-machine-intelligence/perception-engines-8a46bc598d57 “Everybody Dance Now”: https://www.youtube.com/watch?v=PCBTZh41Ris&fbclid=IwAR3LrOkO8jHB3h6Nn4RpDygke2EfC4wF3g0ArM1x_GbUx3iD53YUnx1e9x4 AI Dungeon: https://aidungeon.io/ Google Magenta: https://research.google/teams/brain/magenta/ https://magenta.tensorflow.org/ CAN (Creative Adversarial Networks): https://arxiv.org/abs/1706.07068 Style Transfer "A Neural Algorithm of Artistic Style": https://arxiv.org/abs/1508.06576 https://www.artnome.com/news/2018/11/14/helena-sarin-why-bigger-isnt-always-better-with-gans-and-ai-art

https://marthawhite.github.io/mlbasics/notes.pdf https://marthawhite.github.io/mlcourse/notes.pdf

Neural Module Networks for Reasoning over Text - https://openreview.net/pdf?id=SygWvAVFPr

Multi Step reasoning needed to answer certain kind of questions.
Challenges
- Question Understanding
- Context Understanding
- Perform Reasoning (in a differentiable manner)
Question passed into executable logical program like semantic parsing
Sent into Program Executor
- Consists of Neural (learnable) modules
DROP

Depth Adaptive Transformer - www.openreview.net/pdf?id=SJg7KhVKPH

Models need lot of compute
Key ideas
- Enable anytime prediction in the decoder
- Plug in a halting mechanism to control the amount of computation of a token/sequence.
Training transformer with multiple output classifiers
Aligned training ouperforms training via fixed states
Halting Mechanisms -> Sequence specific, token specific
- Oracle devised supervised learing of the oracle.
End to end training
- Aligned training
- Joint training with halting mechanism
reduce inference cost by up to 58%

The Early Phase of neural network training - https://openreview.net/pdf?id=Hkl1iRNFwS

What happen in the early phase of learning?
Experiments
- Descriptive telemetry
- To what extent is the state of the network related to the distribution of the weights
- To what extent is the state of the network dependent on the data
Early -> 20 or more epochs into training
Weight Distribution
- Trained accuracy is lower after shuffling
- Early phase is not about weight distribution
Data dependence
- Pre-train with aspects of the data removed
- Blurring or self-supervision suffice
- Early phase is data dependent

Neural Symbolic Reader (Nerd) - https://openreview.net/pdf?id=ryxjnREFwH

DROP introduced new challenges, so address those
All you need are not only PLMs.

Residual Energy Based Models for Text Generation - https://openreview.net/pdf?id=B1l4SgHKDH

Revisiting Self Training For Neural Sequence Generation - https://openreview.net/pdf?id=SJgdnAVKDH

WORKS VERY WELL ON LOW RESOURCE DATA REGIMES
Self Training
- Supervised training
- Unlabeled data -> Prediction
- Supervised + Synthetic data to train student model
  - pseudo training on synthetic data
  - Fine tune on labeled data
How does self training work in Sequence generation
- A synthetic case study on machine translation
Two Hypothesis (for why self training works)
- Beam Search Decoding
- Dropout
Dropout is crucial for self training to work
Role of Noise
- Dropout has a smoothing effect on outputs
Noisy Self training on machine translation
Added noise is important for its success.
Future work -> Maybe there's an optimal noise which helps the training more than the rest.

Mixout: Effective Training method for PLMs - https://openreview.net/pdf?id=HkgaETNtDB

Dropout is a special case of dropconnect
Finetuned

Data Dependent Gaussian Prior Objective for Language Generation = https://openreview.net/pdf?id=S1efxTVYDr

https://www.aclweb.org/anthology/P16-1162.pdf - Senreich et al paper
Introduce a D2GPo approach
- MLE fails to assign proper scores to different incorrect model outputs
- All incorrect outputs are treated equally during training
- Generations are dull, generic, repetitive, and short sighted
- A ground truth token wise distribution is considered
- L2 Regularization -> data independent gaussian prior
Problems in NLG
- Exposure Bias
- Loss missmatch during training
- Generation diveristy (Sordoni et al 2019, serban et al 2016, li et al 2015a)
- Negative diversity ignorance (this work, Unlikelihood training)
Data independent gaussian prior
- THe bayesian view is that it is not enough to just use the data but to add prior knowledge
- Add constraints to the model parameters to prevent over fitting
DDGPo
- MLE
- Straightforward and meets the principle of ERM (empirical risk minimization)
- Noise in the training data, cannot reach good generalization
- Introduced a general evaluation function

Ensemble Distribution Distillation

https://iclr.cc/virtual/poster_BygSP6Vtvr.html

Pay Attention to Features, Transfer Learn Faster CNNs

https://iclr.cc/virtual/poster_ryxyCeHtPB.html

U-GAT-IT - https://openreview.net/pdf?id=BJlZ5ySKPH

GAN + Image Translation
Issues
- The network structure or hyper parameter setting needs to be fine tuned for the specific dataset
- The translated image often fail to keep the content features of the original image
- The quality is poor
Proposed
- Attention Module
- Adaptive layer instance normalization (AdaLIN)
- Fix the network structure regardless of the dataset
- keep the content features
- The quality is decent
Generator with an Attention Module then sent into AdaLIN
Decoder

Tree Structure Attention with Hierarchical Accumulation - https://openreview.net/pdf?id=HJxK5pEYvr

The underlying of construction progress of langugae is hierarchical
Transformers prefer linear form
- Linear form allows easier and efficient computations
Dedicated models are recurrent, recursive which is inefficient
Introduce Tree Transformer
- Hierarchical attention to encode parse tress structures into self attention at constant time

Conditional Learning of Fair Representations - https://openreview.net/pdf?id=Hkekl0NFPr

Albert - https://github.com/google-research/ALBERT, https://openreview.net/pdf?id=H1eA7AEtvS

Why are the models inflated on the GPU?

TabFact - https://openreview.net/pdf?id=rkeJRhNYDH

Reflection from Turing Award Winners

Yann le cun

The future of AI is self supervised
How do humans and animals learn?
3 Challenges
- Labeled data
- Learning to reason, b
- Learning to plan complex action sequences
Self Supervised Learning
- Filling in the blanks
- Predict the invisible from the visible
Inference and Multimodal predictions through constraint relaxation
Energy Based Models
- Gradient based inference
- Conditional and uncondtionals versions
- MultiModal Output: latent variable EBM
- Training
  - Contrastive
    - MLE (estimating energy is not exactly a good idea), So try
      - General Margin Loss, hinge pair loss, ranking loss
    - Use a pair of points, Noise contrastive methods[BERT/ ROBERTA, Denoising AE/Masked AE]
      - DeepFace, PIRL, MoCO, SimCLR - Contrastive embeddings [Siamese nets, metric learning]
    - Gans
  - Regularized/Architectural method
    - Regularize the volume of the low energy region
    - K-means, GMMs, PCA, Bottleneck AE, VQVAE
    - Temporal Regularization methods
      - Temporal invariance, minimal curvature o. henaff 2019 et al
    - VAE + Droput yan le cunn et al
      - Used to train forward model of the world
Contrastive methods, regularized latent variable methods
Energy Based Self Supervised Learning

Yoshua Bengio

System 1 vs System 2 Cognition
- System 1(implicit knowledge) - Current DL
- System 2(explicit knowledge) - Future DL
Inductive priors which could go in deep learning
- Sparse factor graph in space of high level semantic variables
- Semantic variables are causal
- Distributional changes due to localized causal interventions
Systematic generalization
- Dynamicaly recombine existing concepts
- Lake et al 2015, Lake & baroni et al, CLOSURE on CLEVR
Thoughts, consciousness and language
Consciousness prior -> Sparse factor graph Bengio 2017
Changing Glasses
Recurrent Independent mechanisms Goyal et al 2019
DL should capture system 2 knoledge
- Need to have consciousness priors

The decision making side of machine learning

Machine learning
- First Generation - the backend
- Second generation - the human side
- third generation - pattern recognition
- fourth generation(emerging) - markets
  - game theory, market design
Decisions
- Its not a matter of a threshold
- Real world decisions with consequences
- set of decisions across a network
- set of decisions across a network over time
- decisions when there is scarcity and competition
Markets
- Decentralized Algorithms
- Complex Taks
- Adaptive, robust, scalable
Recommendation systems
- is it ok to recommend the same movie, book to everyone
- is it ok to recommend the same restuarant to everyone
- is it ok to recommend the same street to every driver
- is it ok to recommend the same stock to purchase to everyone
Create a market
- A two way market between consumers and producers
- the use of recsys via data analysis is key
- Eg: Music in the data age
  - No economic value being exchanged between producers and consumers
- Consumers and producers are linked
Social Consequences
- By creating a market based on the data flows, new jobs are created
Intersection of ML and econ
- Multi way markets in which the individual agents need to explore to learn their preference
- Inferential methods
- latent variable
- data collection in competitive settings
Competeing bandits in matching markets liu, mania et al
- Upper confidence bound algorithm (UCB)
- Matching Markets
  - Stable match
- What if the participants in the market do not know their preferences, but observe utilities through noisy interactions
- Bandit market
UCB meets reinforcement learning jin, zeyuan et al
- is q learning provably efficient
- Q learning with UCB
Anytime control of the false discovery rate ramdas et al
- True nulls and non nulls
  - Do nothing with true nulls
- Care about false discovery rate
  - FDR can be larger than per test error rate
Ray: Distributed platform for emerging decision focused ai application
- github.com/ray-project/ray

A Latent Morphology Model for Open-Vocabulary Neural Machine Translation - www.openreview.net/pdf?id=BJxSI1SKDH

Addresses vocabulary limitation in statistical NMT
Subword Segmentation: Byte Pair Encoding, Sennrich et al 2016
- Arbitrary heuristics need to be tuned
Compostional word representations
- use a bi-rnn (ling et al 2015, luong and manning 2016)
- Morphological rules (Vania and lopez, gul sahin and steedman, 2018)
Model morphological infletion as sampling from a categorical distribution
Each word is represented two latent variables
Inductive biases (ideally model should be unsupervised)
- Learning the variables is modeled as a compression task
- inflection of features should be conditioned on the lemma

Incorporating BERT into NMT - https://github.com/bert-nmt/bert-nmt, https://openreview.net/pdf?id=Hyl7ygStwB

https://openreview.net/pdf?id=Hyl7ygStwB
BERT has made great success on NLU tasks
Preliminary exploration
- Encoder -> Decoder, either of them can be BERT/XLM
- Use pretrained models as input to the NMT model
Directly using to intialize doesnt lead to good results
Leveraging the output of bert as embeddings works
BERT-fused NMT
- Transformer
A drop-net trick (to prevent the network from overfitting to specific network module)

Neural Machine Translation with Universal Visual Representation - https://openreview.net/pdf?id=Byl8hhNYPS

Motivation: Annotation difficulty and limited diversity
Multi30k -> Transform sentence-image pairs into topic-image lookup table
Encoder (transformer + Resnet) -> Aggregation -> Decoder (Transformer)
Visual representations helps
Modest number of pairs would be beneficial
Ablation of encoder done as well, doesnt really matter.
Why does it work?
- Content connection of the sentences and images
- Topic aware co-occurence of similar images and sentences
- Highlights: Universal and Diverse.

Reducing Transformer Depth on Demand with Structured Dropout - www.openreview.net/pdf?id=SylO2yStDr

Transformers are overparameterized and redundant
LayerDrop
- No finetuning required
- training speed
- Layerdrop is an effective regularizer
- Layerdrop for pruning
- Robust to parameter setting

Evaluation

NMT + Summarization + Structured Prediction
Not generalizable over domain
Causal
- https://arxiv.org/pdf/2004.02709.pdf

TabFact https://iclr.cc/virtual/poster_rkeJRhNYDH.html

Explain Your Move: Understanding Agent Actions Using Focused Feature Saliency

Files

ICLR2020.md

Latest commit

History