Skip to content

Latest commit

 

History

History
800 lines (692 loc) · 37 KB

ICLR2020.md

File metadata and controls

800 lines (692 loc) · 37 KB

ICLR 2020

  • ICLR 2020, is being held for the first time virtually, I volunteered for the conference in testing the services.
  • This is my first conference and I'm pretty excited about it.

Climate Change with ML Forecasts (Panel)

  • Think about Flood forecasting
    • Climacell, Boston
    • East Africa has seen unprecedented changes in climate, and the disaster of the flooding would've been handled if there were local information and more data points across the place.
  • Cobuilding solutions with academic communities, users, local government bodies, working with stakeholders, experts on the problem.
  • Explainable AI
    • Causing problems with deployments because there is no trust.
    • Emerging market bodies
      • Trust breaks down when not working with local govt bodies and takes time to seek out the right people
  • https://www.climatechange.ai

Safe and Reliable Machine

Poly-Encoders: Architectures and Pretraining Strategies for Fast and Accurate Multi Sentence Scoring - https://openreview.net/pdf?id=SkxgnnNFvH

  • Rank Candidate Responses Quickly and Accurately
  • Model
    • Transformer Encoder
    • BERT -> MLM + NSP
  • Architecture
    • Bi-Encoder -> Two BERTs, Input and each candidate, get scores by dot product between them
    • Cross-encoder -> Concatenate and then add send through BERT
  • Cross encoder has better performance, Bi-encoder is faster
  • Poly-Encoder

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework - https://openreview.net/pdf?id=S1l-C0NtwS

  • Conduct Systematic studies of two popular cross-lingual word embeddings -> Cross Lingual Alignment vs Joint Training
  • Contributes a unified framework
  • Cross lingual embeddings mean words with similar meanings should be close to each other, to facilitate cross lingual knowledge transfer
    • Goal is to learn a shared vector space for multiple languages.
  • Cross Lingual
    • Train embeddings differently, obtain dictionary and learn a projection matrix
    • Problems of under sharing and no words are aligned
  • Joint training
    • All embeddings trained at the same time, obtains joint vocabulary and trains using concatenated mono-lingual corpora
    • Problem of over sharing and language specific word are poorly aligned
  • Unified Framework
    • set of shared embeddings obtained using unsupervised learning
    • Vocabulary reallocation
      • Words appearing exclusively in a language are reallocated and words appearing with similar frequency are kept at the same level.
    • Alignment Refinement
      • Off the shelf alignment methods used to refine alignments across the non sharing embedding sets.
  • Relative performance is task specific.

Generalization through Memorization: Nearest Neighbor Language Models - https://openreview.net/pdf?id=HklBjCEKvH

  • Asks the question, "Can explicitly memoryizing the training data helps generalization without the added cost of training"
  • Can scale to larger text corpora and helps domain adaptation
  • Start with any Neural autoregressive LM, feed it our test contest, this gives a distribution over the next word and a fixed sized embedding for the context
    • This fixed sized representation is used to query the nearest neighbour datastore offline
  • Constructing the datastore
    • Forward pass over the examples with a PLM
    • This gives context representations over the examples and this gives the keys of the datastore and the targets are the values, to be used during inference.
  • Query is a test context representations and the differences are calculated between it and a subset of keys from the datastore
    • Prune and keep the top k
    • normalize the distances and aggregate over multiple occurences to give knn distribution
    • Last interpolate LM and knn components to get final KNN distribution
  • Wikitext 103 best perplexity

Papers Missed -> https://openreview.net/pdf?id=HJe_Z04Yvr - Adjustable Real Time transfer

Would like to mention that evaluating tasks like document summarization (abstractive) is a very difficult task because its very hard to build a formal notion of what abstraction is, and that current datasets (like CNN/Daily Mail and Newsroom) dont really capture all aspects of the problem. While an objective evaluation is pretty difficult, there have been a variety of papers with a way to interpret the model outputs. A variety of biases in datasets and metrics were mentioned in this paper https://www.aclweb.org/anthology/D19-1327.pdf.

  • Text Style Transfer -> Change the style of the sentence and keep the semantic meaning unchanged.
  • sentiment transfer, formality transfer, author imitation, machine translation -> other kind of text style transfer
  • Adversarial Loss + Auto Encoding loss (shen et al 2017, yang et al 2018)
  • Lample et al 2019
  • The model uses amortized variational inference
    • Subtly different unsupervised objective
    • SOTA and has probabilistic perspectives
  • Transduction assumption and LM model is trained over prior.

Plug and Play Language models: A simple approach to controlled text generation - https://openreview.net/pdf?id=H1edEyBKDS

  • Gpt2 generates generic sentences, the model can be steered to make sentiment based sentences
  • PPLM
    • a LM predicts the token distribution for the next token given the input context
    • Train a small attribute model which models positivity
    • Control degeneration by modeling p(x)
  • Language Detoxification
  • Combining attribute models
  • ULMFit, ELMO, GPT -> generate sequence one token at a time
  • BERT, XLNET, RoBERTa, ALBERT -> MLM pretraining, only learning from typically 15% of tokens which are masked out
  • Replaced Token Detection
    • Replace tokens which sort of fit into the context but dont exactly fit into the context
    • Bidirectional learning from all positions
    • The replacements need to be generated by a generator trained jointly during training
  • Visual-Linguistic Tasks
  • The representations were earlier combined in a task specific way which is hard and time consuming
  • VL BERT
    • Task agnostic architecture
    • Visual-Linguistic pretraining
      • in addition to bert
      • Add image regions
      • Visual Feature embedding
    • Visual Linguistic Corpus: Conceptual Captions
    • Text-only Corpus
  • Pretraining tasks
    • MLM with visual clues (the word is masked in the sentence, and the image is given to predict the word)
    • Masked ROI classfication with linguistic clues (a portion of the image is masked and the sentence is given to predict the masked area)
  • Works very well on VCR, VQA and COCO.
  • Pretraining procedure helps for better embeddings.

On the weakness of Reinforcement Learning for NMT - https://openreview.net/pdf?id=H1eCw3EKvH

  • In RL, training and inference based on system outputs. (no exposure bias)
    • In MT, First warmup with MLE and then use REINFORCE or MRT
    • MRT is not guaranteed to converge
  • Reinforcement with constant 1 is as good as with BLUE
  • Peakiness Effects
    • RL only learns from what it samples and with enough samples it'll maximize the reward
    • If the probability distribution peaky, many sample are needed
      • But if its already peaked or if a model sees many "okay" rewards, it'll become more probable
    • A prediction which gets high reward, its made more probable.
  • MT is already peaked, and RL makes predictions even more peaked.
  • The most probable tokens are extremely probable
    • The top predictions hardly ever change.
  • 1] How much do you think your work translates to other applications of RL in NLP wherein sequence generation is involved? Is peaky-ness an issue specific to NMT?
  • 2] So you mention in the talk, that the problem has an inclination to output more probable words, what do you think would happen if there was a regularization objective which forced the model to consider other words as well. For eg: Unlikelihood training. { THIS MIGHT WORK - EXPERIMENT}

Non Autoregressive for Dialogue State tracking - https://openreview.net/pdf?id=H1e_cC4twS

  • Limitations
    • Unable to detect unseen slot values
    • Tokens are generated one by one, expensive
    • Dependencies among slots are not explicitly learned
  • NADST
    • Fertility decoder
      • Model dependencies among (domain, slot) pairs
    • State decoder
      • Construct input sequence as (domain, slot) x fertility
    • Enables fast decoding of dialogue states

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech - https://openreview.net/pdf?id=H1guaREYPr

  • Inference and generation of faces from speech
  • Unified cross-modal inference and generation framework that can be trained in self-supervised way
  • Training inference module using negative sampling
  • Train conditional GANs using trained inference networks
    • Relativisitic Conditional loss introduced

ReCLOR : Reading Comprehension Dataset requiring logical reasoning - https://openreview.net/pdf?id=HJgJtT4tvB

  • Logical reasoning is an ability to examine, analyze and critically evaluate.
  • Logical Reasoning
    • 0% in MCTest and 1.2% in SQUAD
  • ReCLOR
    • 6138 question
  • Different types of logical reasoning are incorporated

Lite Transformer with Long-Short Range Attention - https://arxiv.org/pdf/2004.11886.pdf

  • The FFN takes more than half of the computation
    • Not desirable
  • Flattened transformer block, attention takes major computation
  • Attention in the original transformer and then added the following block.
    • GLU-> Conv -> FC
  • Flattened the transformer block and LSRA introduced

2020 Vision - Ruha Benjamin

  • Technology is in the driver seat Assumption
    • But we come up and imagine with this
  • Power of ML
    • Think about Horizontal Form of Power
    • Personal Slaves "again", who is the beneficiary of the technology
  • Deep Learning
    • Think beyond computational depth and without historical and sociologicl depth is superficial
    • Do machine learning practitioner learn about technologies destructive usages of automation
      • What all other things are these things being used for
    • Structural Inequity
      • The bench example, discriminatory design, hostile design
        • Collectively respond to it
      • Can be built into technology as well
        • What spikes are we building into physical and digital infrastructures
  • Takeaways
    • Racism is productive, it constructs
    • Race and Technology are coproduced
    • imagination is a field of action
  • Citizen App
    • Race after Technology, Captivating technology books by ruha benjamin
    • How are race affecting coming up with technology
    • The californian app companies delivering apps to kenya (tala, branch companies) resulting into perpetual debt circuits
  • Duplicity of technological fixes (The sleepdealers clip)
  • The new jim
    • Engineered inequity
    • Default discrimination
      • Pro publica report on algorithm bias in parolees in a county
    • Coded exposure
      • Being Included or excluded in the wrong way
    • Techno-Benevolence
  • Advancing racial literacy in tech handbook
  • Ruined by Design By mike monteiro

IBM: NeuroSymbolic Hybrid AI

  • AI -> Machine Learning, DL, CV, NLP

    • General AI -> Revolutionary
    • Broad AI -> Disruptive and pervasive
    • Narrow AI -> Emerging
  • Broad AI, Multitask, MultiDomain, Distributed AI, Explainable (This is what this lab deals with)

  • Gatys et al 2015, Brock et al 2018, Karpathy et al 2015

  • Objects out of context make the model fail, Wang et al 2018

  • ObjectNet

  • Pin yu chen et al 2017 - Adversarial, Xu et al 2019

  • CLEVR Dataset

  • Symbolic AI

  • Neural-Symbolic AI (reading list)

    • Concept and Reasoning are usually entangled and hard for neural networks
    • NS-VQA (Yi et al 2018)
    • Neurosymbolic concept learner, ICLR 2019
    • Neurosymbolic Metaconcept Learner Neurips 2019
    • Dynamic Scenes Conterfactual ICLR 2020
    • CLEVERER
    • Neurosymbolic generative models srivastava et al 2020
    • Neurosymbolic NLU Naacl 2019
    • Neurosymbolic Code optimization shi et al 2019

READING

Improving NLG with Spectrum Control - https://openreview.net/pdf?id=ByxY8CNtvr

Unlikelihood Training

  • Followup work - https://arxiv.org/pdf/1911.03860.pdf
  • Sparse Text Generation - https://arxiv.org/pdf/2004.02644.pdf
  • Summarization Evaluation
    • Correlation with human judgements for different types of summarizations. We found the WMT human judgements data to be extremely useful for this work. I don't know that an equivalent resource exists for summarization. There's a bit for caption generation, but we are largely going in blind. Another issue is the diversity of correct outputs. It's very likely that there are so many correct outputs for summarization, many very different from others. This makes token matching evaluation a bit inappropriate. I am not a summarization expert by a long shot, so I might be wrong here. (Yoav artzi comments)
  • https://arxiv.org/abs/1904.09751 Look at Citations

Tools list -> https://docs.google.com/spreadsheets/d/1jvJXusSCqqnDQD35RGEknVdZf4Ay7O3oZGafQgqPctA/edit#gid=0

Playing the lottery with rewards and multiple languages: Lottery tickets in RL and NLP - https://openreview.net/pdf?id=S1xnXRVFwH

  • Lottery ticket hypothesis suggests that current initialization has substantial room to improve.
  • This paper shows the existence of this phenomena in RL and NLP
  • Rewinding and Iterative pruning are both important for finding lottery tickets in NLP

Drawing Early Bird Tickets: Towards more efficient training of Deep Networks - https://openreview.net/pdf?id=BJxsrgStvr

  • Progressive Pruning and Training (J.Frankle, ICLR 2019)
  • Early Bird training proposed
    • Existence of early bird tickets discovered
    • propose a detector to find these early birds
      • Hamming distance based method
    • Efficient training via EB tackets
      • Reduced training cost by upto 80%

Deep Learning for Symbolic Maths - https://openreview.net/pdf?id=S1eZYeHFDS

  • An interesting insight

What can Neural Networks reason about? - https://openreview.net/pdf?id=rJxbJeHFPS

  • Need AI which reason about the world
  • Better "Algorithmic alignment" implies better generalization
    • Inductive bias of architectures formally defined
  • GNN similar to bellman ford algorithm i.e they align with the structure of the algorithm
  • GNN aligns with Dynamic programming as well.
  • GNN has limitations specifically on The subset sum (NP-hard problem)
  • Neural Exhaustive Search - Based on algorithm alignment achieves top performance on the task.
  • If

A Mutual Information Maximization Perspective of Language Representation Learning - https://openreview.net/pdf?id=Syx79eBKwr

  • Mutual Information Maximization
  • InfoNCE is Used (logeswaran et al 2018, van den oord et al 2019)
  • InfoWord; Deep Infomax(DIM; Hjelm et al 2019)
  • Performs better on squad wrt original metrics [CHECK]

Reformer: The Efficient Transformer - https://openreview.net/pdf?id=rkgNKkHtvB

Augmenting Non-Collaborative Dialog Systems with Explicit Semantic and Strategic Dialog History - https://openreview.net/pdf?id=ryxQuANKPB

  • Non-Collaborative -> Where the agents have different interest.
  • Negotiation Dialogue
  • Persuader dialogue
  • FST-DA, FST-S (FST -> Finite state transducer)
  • Training With Greedy Algorithm
  • Automatically trained FST

On the Relationship between Self-Attention and Convolutional Layers - https://openreview.net/pdf?id=HJlnC1rKPB

  • Transformer has done very well on NLP tasks.
  • The transformer has also done very well on Vision Tasks.
    • Bello et al 2019, Ramachandran et al 2019
  • Why does transformer work so well in vision?
    • This paper shows that MHSA (Multi head self attention) can express any convolution
  • Proved by Reparameterization of MHSA into any CNN layer.
  • 3 questions tackled by experiments
    • Is the theoretical reparametrization learnable (Yes)
    • could positional encoding be learned from scratch (Yes)
    • Does Full self attention on image perform convolution (Yes)

Encoding Word Order in complex embeddings - https://openreview.net/pdf?id=Hke-WTVtwr

  • Position Embeddings in Transformers
  • PE maps from a position index to a n-dimensional vector
  • Transformer talks about TPE(Trignometric PE) fixed position embedding
    • Can not be trained
    • TPE
  • Desiderata for word function
    • PFRD(Position free relative distance)
    • Boundedness
  • Their method is more "general"
  • In text classification, better performance wrt PE/ TPE.
  • In machine translation, better performance.

Machine Learning: Changing the future of healthcare

  • https://www.youtube.com/watch?v=EVl5iMpX1cg, https://www.youtube.com/watch?v=RfK3D5dJV2Q
  • In the case of medicine, its not a well posed problem, so ML doesnt do very well. solutions are not well defined and solutions are very hard to verify.
  • Problem formulation
    • Many ways to conceive of a problem
    • Many ways to formalize each conception
    • input is limited and output is human based
  • ML-AIM predictor, AutoPrognosis [ICML 2018]
    • It needs to be interpretable
    • Therefor INVASE [ICLR 2019] was developed
    • Attentive State-Space [Neurips 2019], State of the art time forecast models
    • Deep Sensing [ICLR 2018]
    • GANITE, NSGP, Counterfactual RNNs
    • All of these have been developed for one system
  • Clinical Predictive Analytics
    • Automating the design of clinical predictive analytics
      • AutoML, let machine learning craft models for the variety of diseases, the variables and needs [Doesn't work well]
        • Dont capture uncertainty
        • limited to classification
        • Not interpretable
      • AutoPrognosis [Alaa & vds, ICML 2018] (A tool for crafting clinical scores)
      • Use Ensembles
        • To have uncertainty estimates
        • and to know the information loss
      • Each pipeline is a path of algorithms
        • Hard Learning and optimization problem
        • Bayesian Optimization doesnt work in this case because of high dimensions
        • Bayesian Optimization with structured kernel learning (Some algorithms have similar structure/ are correlated)
          • One gaussian process per set of algorithms
    • Interpretability and Explainability
      • Need transparency, risk understanding, avoid implicit bias, discovery
      • Explainability -> Tailored interpretability
        • For the clinician, researcher and patient need different kinds of explanations
      • Current methods are tailored to one type of interpretation
      • Need model independent and post hoc manner way of interpretations
      • INVASE handles most of these issues
      • Understand what features are important for the model and the disease
      • Produce a transparent risk equation
      • Demystify black-box models using symbolic metamodels [Neurips 2019]
        • Current black box models using symbolic metamodeling, outputs some equation.
      • Interpretability using symbolic metamodeling in practice [Neurips 2019]
      • Metamodels can be used in two ways, Forward use, input is features and output is risk, and backward use, where input is desired risk reduction and output is featues
    • Dynamic Forecasting
      • The markov assumption is harmful in clinical cases, history matters.
      • Clinical actionable models needed for patient level trajectory
        • Learn from complex data, clinical annotations, history etc.
      • Attentive state space models [2018,2019]
        • Combination of probabilistic structure of HMMs but use RNNs to model state dynamics
      • Who, when, what to screen? and how to screen ?
        • Deep Sensing [ICLR 2018]
        • Diseases atlas
    • Individualized treatments effects causal inference
      • GANITE, NSGP, Counterfactual RNNs
      • Conterfactuals arent observed so the problem is hard.
      • Alaa, vander scharr JSTSP 2017, ICML 2018
        • A first theory for causal inference.
  • Theory Guides Model Design
  • www.vanderschaar-lab.com

Model Based Reinforcement Learning for atari - https://openreview.net/pdf?id=S1xCPJHtDB

  • Model Free RL has good results but require a lot of interaction, Model based has better sample complexity.
  • Novel architecture for video based prediction to be used as model.
    • SimPLE
    • The model is predicting the next frame, from the previous 4 frames
      • And this essentially builds a simulator of the game
  • Policy training with Random Starts
  • Mastering Atari, Go, chess and shogi by planning with a learned model

Differentiable Learning of numerical rules in knowledge graphs - https://openreview.net/pdf?id=rJleKgrKwS

  • Knowledge graph = Multi-graph with typed edges
    • In practice we might have additional edges
  • Learn (numerical) rules from KG and complete missing edges
  • NeuralLP to impleting numerical rule matching
  • Contribution of this paper
    • Efficient matrix-vector mult for numerical operators
    • Assume values are sorted by the permutation matrices.

AI systems that can see and talk

  • important to explore the intersection of vision and language because

    • Pushes boundaries
    • Vision: Beyond "bucketed" recognition, long-tail, low-shot
    • Language: Grounding, Reasoning
    • Processing across continuous(vision) + discrete(language),
      • low level(vision) + high level(language)
    • Control biases from both modalities
  • VQA (Visual Question Answering)

    • The VQA v2
  • Typical Model Architecture

    •     Question Encoding -> Attention         -> Classifier
                                   +
      
      Visual Feature extraction -> Multimodal fusion ->
    • Bottom-up, Top-down Anderson et al 2019
    • vqa.cloudcv.org
  • Challenges

    • Strong Language Priors, insufficient grounding
      • Karpathy & fei fei et al 2015
      • Neural baby talk 2018 CVPR 2018
        • Allows robust captioning
        • Novel object captioning
        • The above are critical for real-world problems
        • nocaps.org
    • Integrating Vision Modules
      • VQA - VQA-CP CVPR 2018 - GVQA CVPR 2018 - Difficult "questions" from challenge - All of the top models failed in OCR (Trend in 2018, 2017, 2016) - Integration of these technologies needed - TextVQA, (Look, Read, Reason, Answer, Paper)
        • CVPR Competition deadlines (mid may)
    • Generic visio-linguistic representations
      • Earlier there were visuo-linguistic task specific models.
      • VilBERT (Neurips 2019)
        • Pretrain with masked region and masked sentence
        • 5 Downstream tasks
        • Got SOTA
      • VilBERT Multi-Task (CVPR 2020)
  • Challenges

    • Centered around COCO
    • Incorporating External Knowledge,
      • FVQA, OK-VQA
    • Evaluation for downstream tasks; human-AI loop.
    • Grounding in real applications
    • VizWiz Gurari et al 2018
    • Other Languages
    • Studying Biases in VQA
      • Hendricks et al 2018, Gender Bias
      • ZHao et al 2018,
  • Time managment

  • AI + Creativity

  • Climate change

AI + Creativty

https://marthawhite.github.io/mlbasics/notes.pdf https://marthawhite.github.io/mlcourse/notes.pdf

Neural Module Networks for Reasoning over Text - https://openreview.net/pdf?id=SygWvAVFPr

  • Multi Step reasoning needed to answer certain kind of questions.
  • Challenges
    • Question Understanding
    • Context Understanding
    • Perform Reasoning (in a differentiable manner)
  • Question passed into executable logical program like semantic parsing
  • Sent into Program Executor
    • Consists of Neural (learnable) modules
  • DROP

Depth Adaptive Transformer - www.openreview.net/pdf?id=SJg7KhVKPH

  • Models need lot of compute
  • Key ideas
    • Enable anytime prediction in the decoder
    • Plug in a halting mechanism to control the amount of computation of a token/sequence.
  • Training transformer with multiple output classifiers
  • Aligned training ouperforms training via fixed states
  • Halting Mechanisms -> Sequence specific, token specific
    • Oracle devised supervised learing of the oracle.
  • End to end training
    • Aligned training
    • Joint training with halting mechanism
  • reduce inference cost by up to 58%

The Early Phase of neural network training - https://openreview.net/pdf?id=Hkl1iRNFwS

  • What happen in the early phase of learning?
  • Experiments
    • Descriptive telemetry
    • To what extent is the state of the network related to the distribution of the weights
    • To what extent is the state of the network dependent on the data
  • Early -> 20 or more epochs into training
  • Weight Distribution
    • Trained accuracy is lower after shuffling
    • Early phase is not about weight distribution
  • Data dependence
    • Pre-train with aspects of the data removed
    • Blurring or self-supervision suffice
    • Early phase is data dependent

Neural Symbolic Reader (Nerd) - https://openreview.net/pdf?id=ryxjnREFwH

  • DROP introduced new challenges, so address those
  • All you need are not only PLMs.

Residual Energy Based Models for Text Generation - https://openreview.net/pdf?id=B1l4SgHKDH

Revisiting Self Training For Neural Sequence Generation - https://openreview.net/pdf?id=SJgdnAVKDH

  • WORKS VERY WELL ON LOW RESOURCE DATA REGIMES
  • Self Training
    • Supervised training
    • Unlabeled data -> Prediction
    • Supervised + Synthetic data to train student model
      • pseudo training on synthetic data
      • Fine tune on labeled data
  • How does self training work in Sequence generation
    • A synthetic case study on machine translation
  • Two Hypothesis (for why self training works)
    • Beam Search Decoding
    • Dropout
  • Dropout is crucial for self training to work
  • Role of Noise
    • Dropout has a smoothing effect on outputs
  • Noisy Self training on machine translation
  • Added noise is important for its success.
  • Future work -> Maybe there's an optimal noise which helps the training more than the rest.

Mixout: Effective Training method for PLMs - https://openreview.net/pdf?id=HkgaETNtDB

  • Dropout is a special case of dropconnect
  • Finetuned

Data Dependent Gaussian Prior Objective for Language Generation = https://openreview.net/pdf?id=S1efxTVYDr

  • https://www.aclweb.org/anthology/P16-1162.pdf - Senreich et al paper
  • Introduce a D2GPo approach
    • MLE fails to assign proper scores to different incorrect model outputs
    • All incorrect outputs are treated equally during training
    • Generations are dull, generic, repetitive, and short sighted
    • A ground truth token wise distribution is considered
    • L2 Regularization -> data independent gaussian prior
  • Problems in NLG
    • Exposure Bias
    • Loss missmatch during training
    • Generation diveristy (Sordoni et al 2019, serban et al 2016, li et al 2015a)
    • Negative diversity ignorance (this work, Unlikelihood training)
  • Data independent gaussian prior
    • THe bayesian view is that it is not enough to just use the data but to add prior knowledge
    • Add constraints to the model parameters to prevent over fitting
  • DDGPo
    • MLE
    • Straightforward and meets the principle of ERM (empirical risk minimization)
    • Noise in the training data, cannot reach good generalization
    • Introduced a general evaluation function

Ensemble Distribution Distillation

Pay Attention to Features, Transfer Learn Faster CNNs

  • GAN + Image Translation
  • Issues
    • The network structure or hyper parameter setting needs to be fine tuned for the specific dataset
    • The translated image often fail to keep the content features of the original image
    • The quality is poor
  • Proposed
    • Attention Module
    • Adaptive layer instance normalization (AdaLIN)
    • Fix the network structure regardless of the dataset
    • keep the content features
    • The quality is decent
  • Generator with an Attention Module then sent into AdaLIN
  • Decoder

Tree Structure Attention with Hierarchical Accumulation - https://openreview.net/pdf?id=HJxK5pEYvr

  • The underlying of construction progress of langugae is hierarchical
  • Transformers prefer linear form
    • Linear form allows easier and efficient computations
  • Dedicated models are recurrent, recursive which is inefficient
  • Introduce Tree Transformer
    • Hierarchical attention to encode parse tress structures into self attention at constant time

Conditional Learning of Fair Representations - https://openreview.net/pdf?id=Hkekl0NFPr

  • Why are the models inflated on the GPU?

Reflection from Turing Award Winners

Yann le cun

  • The future of AI is self supervised
  • How do humans and animals learn?
  • 3 Challenges
    • Labeled data
    • Learning to reason, b
    • Learning to plan complex action sequences
  • Self Supervised Learning
    • Filling in the blanks
    • Predict the invisible from the visible
  • Inference and Multimodal predictions through constraint relaxation
  • Energy Based Models
    • Gradient based inference
    • Conditional and uncondtionals versions
    • MultiModal Output: latent variable EBM
    • Training
      • Contrastive
        • MLE (estimating energy is not exactly a good idea), So try
          • General Margin Loss, hinge pair loss, ranking loss
        • Use a pair of points, Noise contrastive methods[BERT/ ROBERTA, Denoising AE/Masked AE]
          • DeepFace, PIRL, MoCO, SimCLR - Contrastive embeddings [Siamese nets, metric learning]
        • Gans
      • Regularized/Architectural method
        • Regularize the volume of the low energy region
        • K-means, GMMs, PCA, Bottleneck AE, VQVAE
        • Temporal Regularization methods
          • Temporal invariance, minimal curvature o. henaff 2019 et al
        • VAE + Droput yan le cunn et al
          • Used to train forward model of the world
  • Contrastive methods, regularized latent variable methods
  • Energy Based Self Supervised Learning

Yoshua Bengio

  • System 1 vs System 2 Cognition
    • System 1(implicit knowledge) - Current DL
    • System 2(explicit knowledge) - Future DL
  • Inductive priors which could go in deep learning
    • Sparse factor graph in space of high level semantic variables
    • Semantic variables are causal
    • Distributional changes due to localized causal interventions
  • Systematic generalization
    • Dynamicaly recombine existing concepts
    • Lake et al 2015, Lake & baroni et al, CLOSURE on CLEVR
  • Thoughts, consciousness and language
  • Consciousness prior -> Sparse factor graph Bengio 2017
  • Changing Glasses
  • Recurrent Independent mechanisms Goyal et al 2019
  • DL should capture system 2 knoledge
    • Need to have consciousness priors

The decision making side of machine learning

  • Machine learning

    • First Generation - the backend
    • Second generation - the human side
    • third generation - pattern recognition
    • fourth generation(emerging) - markets
      • game theory, market design
  • Decisions

    • Its not a matter of a threshold
    • Real world decisions with consequences
    • set of decisions across a network
    • set of decisions across a network over time
    • decisions when there is scarcity and competition
  • Markets

    • Decentralized Algorithms
    • Complex Taks
    • Adaptive, robust, scalable
  • Recommendation systems

    • is it ok to recommend the same movie, book to everyone
    • is it ok to recommend the same restuarant to everyone
    • is it ok to recommend the same street to every driver
    • is it ok to recommend the same stock to purchase to everyone
  • Create a market

    • A two way market between consumers and producers
    • the use of recsys via data analysis is key
    • Eg: Music in the data age
      • No economic value being exchanged between producers and consumers
    • Consumers and producers are linked
  • Social Consequences

    • By creating a market based on the data flows, new jobs are created
  • Intersection of ML and econ

    • Multi way markets in which the individual agents need to explore to learn their preference
    • Inferential methods
    • latent variable
    • data collection in competitive settings
  • Competeing bandits in matching markets liu, mania et al

    • Upper confidence bound algorithm (UCB)
    • Matching Markets
      • Stable match
    • What if the participants in the market do not know their preferences, but observe utilities through noisy interactions
    • Bandit market
  • UCB meets reinforcement learning jin, zeyuan et al

    • is q learning provably efficient
    • Q learning with UCB
  • Anytime control of the false discovery rate ramdas et al

    • True nulls and non nulls
      • Do nothing with true nulls
    • Care about false discovery rate
      • FDR can be larger than per test error rate
  • Ray: Distributed platform for emerging decision focused ai application

    • github.com/ray-project/ray

A Latent Morphology Model for Open-Vocabulary Neural Machine Translation - www.openreview.net/pdf?id=BJxSI1SKDH

  • Addresses vocabulary limitation in statistical NMT
  • Subword Segmentation: Byte Pair Encoding, Sennrich et al 2016
    • Arbitrary heuristics need to be tuned
  • Compostional word representations
    • use a bi-rnn (ling et al 2015, luong and manning 2016)
    • Morphological rules (Vania and lopez, gul sahin and steedman, 2018)
  • Model morphological infletion as sampling from a categorical distribution
  • Each word is represented two latent variables
  • Inductive biases (ideally model should be unsupervised)
    • Learning the variables is modeled as a compression task
    • inflection of features should be conditioned on the lemma
  • https://openreview.net/pdf?id=Hyl7ygStwB
  • BERT has made great success on NLU tasks
  • Preliminary exploration
    • Encoder -> Decoder, either of them can be BERT/XLM
    • Use pretrained models as input to the NMT model
  • Directly using to intialize doesnt lead to good results
  • Leveraging the output of bert as embeddings works
  • BERT-fused NMT
    • Transformer
  • A drop-net trick (to prevent the network from overfitting to specific network module)

Neural Machine Translation with Universal Visual Representation - https://openreview.net/pdf?id=Byl8hhNYPS

  • Motivation: Annotation difficulty and limited diversity
  • Multi30k -> Transform sentence-image pairs into topic-image lookup table
  • Encoder (transformer + Resnet) -> Aggregation -> Decoder (Transformer)
  • Visual representations helps
  • Modest number of pairs would be beneficial
  • Ablation of encoder done as well, doesnt really matter.
  • Why does it work?
    • Content connection of the sentences and images
    • Topic aware co-occurence of similar images and sentences
    • Highlights: Universal and Diverse.

Reducing Transformer Depth on Demand with Structured Dropout - www.openreview.net/pdf?id=SylO2yStDr

  • Transformers are overparameterized and redundant
  • LayerDrop
    • No finetuning required
    • training speed
    • Layerdrop is an effective regularizer
    • Layerdrop for pruning
    • Robust to parameter setting

Evaluation

Explain Your Move: Understanding Agent Actions Using Focused Feature Saliency