Schedule

DISCLAIMER: The workshop papers haven't been reviewed for correctness.
* identifies the presenter.

videos of the workshop

Workshop room: Westin: Alpine DE

07:30-07:50 Opening address: themes of the workshop, terminology, open questions
Simon Lacoste-Julien, Percy Liang, Guillaume Bouchard
video
07:50-08:20 Invited talk: Generative and Discriminative Models in Statistical Parsing
Michael Collins (MIT)
video
08:20-08:40 Generative and Discriminative Latent Variable Grammars
Slav Petrov (Google Research)
videopaper
08:40-09:00 Discriminative and Generative Views of Binary Experiments
Mark D. Reid, Robert C. Williamson* (Australian National University)
videopaper
09:00-09:30 coffee break
09:30-10:00 Invited talk: Multi-Task Discriminative Estimation for Generative Models and Probabilities
Tony Jebara (Columbia University)
video
10:00- Poster Session
SKI / DISCUSSION BREAK
15:50-16:20 Invited talk: Generative and Discriminative Image Models
John Winn (Microsoft Research Cambridge)
video
16:20-16:40 Learning Feature Hierarchies by Learning Deep Generative Models
Ruslan Salakhutdinov (MIT)
videopaper
16:40-17:00 Why does Unsupervised Pre-training Help Deep Discriminant Learning?
Dumitru Erhan*, Yoshua Bengio, Aaron Courville Pierre-Antoine Manzagol, Pascal Vincent (Université de Montréal)
videopaper
17:00-17:30 coffee break
17:30-17:50 Unsupervised Learning by Discriminating Data from Artificial Noise
Michael Gutmann*, Aapo Hyvärinen (University of Helsinki)
videopaper
17:50-18:45 Panel Discussion
18:45-18:50 wrap-up

Invited Talks


Generative and Discriminative Models in Statistical Parsing ^
Michael Collins (MIT)
Since the earliest work on statistical parsing, a constant theme has been the development of discriminative and generative models with complementary strengths. In this work I’ll give a brief history of discriminative and generative models in statistical parsing, focusing on strengths and weaknesses of the various models. I’ll start with early work on discriminative history-based models (in particular, the SPATTER parser), moving through early discriminative and generative models based on lexicalized (dependency) representations, through to recent work on conditional-random-field based models. Finally, I’ll describe research on semi-supervised approaches that combine discriminative and generative models.


Multi-Task Discriminative Estimation for Generative Models and Probabilities ^
Tony Jebara (Columbia University)
Maximum entropy discrimination is a method for estimating distributions such that they meet classification constraints and perform accurate prediction. These distributions are over parameters of a classifier, for instance, log-linear prediction models or log-likelihood ratios of generative models. Many of the resulting optimization problems are convex programs and sometimes just simple quadratic programs. In multi-task settings, several discrimination constraints are available from many tasks which potentially produce even better discrimination. This advantage manifests itself if some parameter tying is involved, for instance, via multi-task sparsity assumptions. Using new variational bounds, it is possible to implement the multitask variants as (sequential) quadratic programs or sequential versions of the independent discrimination problems. In these settings, it is possible to show that multi-task discrimination requires no more than a constant increase in computation over independent single-task discrimination.


Generative and Discriminative Image Models ^
John Winn (Microsoft Research Cambridge)
Creating a good probabilistic model for images is a challenging task, due to the large variability in natural images. For general photographs, an ideal generative model would have to cope with scene layout, occlusion, variability in object appearance, variability in object position and 3D rotation and illumination effects like shading and shadows. The formidable challenges in creating such a model have led many researchers to pursue discriminative models, which instead use image features that are largely invariant to many of these sources of variability. In this talk, I will compare both approaches and describe some strengths and weaknesses of each and suggest some directions in which the best aspects of both can be combined.

Contributed Talks


Generative and Discriminative Latent Variable Grammars ^
Slav Petrov (Google Research)
Latent variable grammars take an observed (coarse) treebank and induce more fine-grained grammar categories, that are better suited for modeling the syntax of natural languages. Estimation can be done in a generative or a discriminative framework, and results in the best published parsing accuracies over a wide range of syntactically divergent languages and domains. In this paper we highlight the commonalities and the differences between the two learning paradigms and speculate that a hybrid approach might outperform either respectively.
paper


Discriminative and Generative Views of Binary Experiments ^
Mark D. Reid, Robert C. Williamson (Australian National University and NICTA)
We consider Binary experiments (supervised learning problems where there are two different labels) and explore formal relationships between two views of them, which we call “generative” and “discriminative”. The discriminative perspective involves an expected loss. The generative perspective (in our sense) involves the distances between class-conditional distributions. We extend known results to the class of all proper losses (scoring rules) and all f-divergences as distances between distributions. We also sketch how one can derive the SVM and MMD algorithms from the generative perspective.
paper


Learning Feature Hierarchies by Learning Deep Generative Models ^
Ruslan Salakhutdinov (MIT)
In this paper we present several ideas based on learning deep generative models from high-dimensional, richly structured sensory input. We will exploit the following two key properties: First, we show that deep generative models can be learned efficiently from large amounts of unlabeled data. Second, they can be discriminatively fine-tuned using the standard backpropagation algorithm. Our results reveal that the learned high-level feature representations capture a lot of structure in the unlabeled input data, which is useful for subsequent discriminative tasks, such as classification or regression, even though these tasks are unknown when the deep generative model is being trained.
paper


Why does Unsupervised Pre-training Help Deep Discriminant Learning? ^
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent (Université de Montréal)
Recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase, with a generative model. Even though these new algorithms have enabled training deep models fine-tuned with a discriminant criterion, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: why does unsupervised pre-training work and why does it work so well? Answering these questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of unsupervised pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that are better in terms of the underlying data distribution; the evidence from these results supports an unusual regularization explanation for the effect of pre-training.
paper


Unsupervised Learning by Discriminating Data from Artificial Noise ^
Michael Gutmann, Aapo Hyvärinen (University of Helsinki)
Noise-contrastive estimation is a new estimation principle that we have developed for parameterized statistical models. The idea is to train a classifier to discriminate between the observed data and some artificially generated noise, using the model log-density function in a logistic regression function. It can be proven that this leads to a consistent (convergent) estimator of the parameters. The method is shown to directly work for models where the density function does not integrate to unity (unnormalized models). The normalization constant (partition function) can be estimated like any other parameter. We compare the method with other methods that can be used to estimate unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the correct normalization is estimated with importance sampling. Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency. The method is then applied to the modeling of natural images.
paper

Posters


Inferring Meta-covariates Via an Integrated Generative and Discriminative Model ^
Keith J. Harris, Lisa Hopcroft, Mark Girolami (University of Glasgow)
This paper develops an alternative method for analysing high dimensional data sets that combines model based clustering and multiclass classification. By averaging the covariates within the clusters obtained from model based clustering, we define “meta-covariates” and use them to build a multinomial probit regression model, thereby selecting clusters of similarly behaving covariates, aiding interpretation. This simultaneous learning task is accomplished by a variational EM algorithm that optimises a joint distribution which rewards good performance at both classification and clustering. We explore the performance of our methodology on a well known leukaemia dataset and use the Gene Ontology to interpret our results.
paper


Naïve Bayes vs. Logistic Regression: An Assessment of the Impact of the Misclassification Cost ^
Vidit Jain (University of Massachusetts Amherst)
Recent advances in the asymptotic characterization of generative and discriminative learning have suggested several ways to develop more effective hybrid models. An application of these suggested approaches to a practical problem domain remains non-trivial, perhaps due to the violation of various underlying assumptions. One common assumption corresponds to the choice of equal misclassification cost or the ability to estimate such cost. Here, we investigate the effect of this misclassification cost on the comparison between naïve Bayes and logistic regression. To assess the utility of this comparison for practical domains, we include a comparison of mean average precision values for our experiments. We present the empirical comparison patterns on the LETOR data set to solicit the support from related theoretical results.
paper


Hybrid Model of Conditional Random Field and Support Vector Machine ^
Qinfeng Shi, Mark Reid, Tiberio Caetano (Australia National University and NICTA)
It is known that probabilistic models often converge to the true distribution asymptotically (i.e. fisher consistent). However, the consistency is often useless in practice, since in real world it is impossible to fit the models with infinite many data in a finite time. SVM is fisher inconsistent in multiclass and structured label case, however, it does provide a PAC bound on the true error (known as generalization bound). Is there a model that is fisher consistent for classification and has a generalization bound? We use a naive combination of two models by simply weighted summing up the losses of two. It turns out a surprising theoretical result — the hybrid loss could be fisher consistent in some circumstance and it has a PAC-bayes bound on its true error.
paper


Weighting Priors for Hybrid Learning Principles ^
Jens Keilwagen (IPK Gatersleben), Jan Grau, Stefan Posch, Marc Strickert, Ivo Grosse (University of Halle-Wittenberg)
The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has be payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too. Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a-posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases. We present an interesting interpretation of this learning principle in case of a special class of priors, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites.
paper


Parameter Estimation in a Hierarchical Model for Species Occupancy ^
Rebecca A. Hutchinson, Thomas G. Dietterich (Oregon State University)
In this paper, we describe a model for the relationship between the occupancy pattern of a species on a landscape and imperfect observations of its presence or absence. The structure in the observation process is incorporated generatively, and environmental inputs are incorporated discriminatively. Our experiments on synthetic data compare two methods for training this model under various regularization schemes. Our results suggest that maximizing the expected log-likelihood of the observations and the unknown true occupancy produces parameter estimates that are closer to the truth than maximizing the conditional likelihood of the observations alone.
paper

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License