DISCLAIMER: The workshop papers haven't been reviewed for correctness.
* identifies the presenter.
Workshop room: Westin: Alpine DE
07:3007:50  Opening address: themes of the workshop, terminology, open questions Simon LacosteJulien, Percy Liang, Guillaume Bouchard video 
07:5008:20  Invited talk: Generative and Discriminative Models in Statistical Parsing Michael Collins (MIT) video 
08:2008:40  Generative and Discriminative Latent Variable Grammars Slav Petrov (Google Research) video — paper 
08:4009:00  Discriminative and Generative Views of Binary Experiments Mark D. Reid, Robert C. Williamson* (Australian National University) video — paper 
09:0009:30  coffee break 
09:3010:00  Invited talk: MultiTask Discriminative Estimation for Generative Models and Probabilities Tony Jebara (Columbia University) video 
10:00  Poster Session

SKI / DISCUSSION BREAK 
15:5016:20  Invited talk: Generative and Discriminative Image Models John Winn (Microsoft Research Cambridge) video 
16:2016:40  Learning Feature Hierarchies by Learning Deep Generative Models Ruslan Salakhutdinov (MIT) video — paper 
16:4017:00  Why does Unsupervised Pretraining Help Deep Discriminant Learning? Dumitru Erhan*, Yoshua Bengio, Aaron Courville PierreAntoine Manzagol, Pascal Vincent (Université de Montréal) video — paper 
17:0017:30  coffee break 
17:3017:50  Unsupervised Learning by Discriminating Data from Artificial Noise Michael Gutmann*, Aapo Hyvärinen (University of Helsinki) video — paper 
17:5018:45  Panel Discussion

18:4518:50  wrapup 
Invited Talks
Generative and Discriminative Models in Statistical Parsing ^
Michael Collins (MIT)
Since the earliest work on statistical parsing, a constant theme has been the development of discriminative and generative models with complementary strengths. In this work I’ll give a brief history of discriminative and generative models in statistical parsing, focusing on strengths and weaknesses of the various models. I’ll start with early work on discriminative historybased models (in particular, the SPATTER parser), moving through early discriminative and generative models based on lexicalized (dependency) representations, through to recent work on conditionalrandomfield based models. Finally, I’ll describe research on semisupervised approaches that combine discriminative and generative models.
MultiTask Discriminative Estimation for Generative Models and Probabilities ^
Tony Jebara (Columbia University)
Maximum entropy discrimination is a method for estimating distributions such that they meet classification constraints and perform accurate prediction. These distributions are over parameters of a classifier, for instance, loglinear prediction models or loglikelihood ratios of generative models. Many of the resulting optimization problems are convex programs and sometimes just simple quadratic programs. In multitask settings, several discrimination constraints are available from many tasks which potentially produce even better discrimination. This advantage manifests itself if some parameter tying is involved, for instance, via multitask sparsity assumptions. Using new variational bounds, it is possible to implement the multitask variants as (sequential) quadratic programs or sequential versions of the independent discrimination problems. In these settings, it is possible to show that multitask discrimination requires no more than a constant increase in computation over independent singletask discrimination.
Generative and Discriminative Image Models ^
John Winn (Microsoft Research Cambridge)
Creating a good probabilistic model for images is a challenging task, due to the large variability in natural images. For general photographs, an ideal generative model would have to cope with scene layout, occlusion, variability in object appearance, variability in object position and 3D rotation and illumination effects like shading and shadows. The formidable challenges in creating such a model have led many researchers to pursue discriminative models, which instead use image features that are largely invariant to many of these sources of variability. In this talk, I will compare both approaches and describe some strengths and weaknesses of each and suggest some directions in which the best aspects of both can be combined.
Contributed Talks
Generative and Discriminative Latent Variable Grammars ^
Slav Petrov (Google Research)
Latent variable grammars take an observed (coarse) treebank and induce more finegrained grammar categories, that are better suited for modeling the syntax of natural languages. Estimation can be done in a generative or a discriminative framework, and results in the best published parsing accuracies over a wide range of syntactically divergent languages and domains. In this paper we highlight the commonalities and the differences between the two learning paradigms and speculate that a hybrid approach might outperform either respectively.
paper
Discriminative and Generative Views of Binary Experiments ^
Mark D. Reid, Robert C. Williamson (Australian National University and NICTA)
We consider Binary experiments (supervised learning problems where there are two different labels) and explore formal relationships between two views of them, which we call “generative” and “discriminative”. The discriminative perspective involves an expected loss. The generative perspective (in our sense) involves the distances between classconditional distributions. We extend known results to the class of all proper losses (scoring rules) and all fdivergences as distances between distributions. We also sketch how one can derive the SVM and MMD algorithms from the generative perspective.
paper
Learning Feature Hierarchies by Learning Deep Generative Models ^
Ruslan Salakhutdinov (MIT)
In this paper we present several ideas based on learning deep generative models from highdimensional, richly structured sensory input. We will exploit the following two key properties: First, we show that deep generative models can be learned efficiently from large amounts of unlabeled data. Second, they can be discriminatively finetuned using the standard backpropagation algorithm. Our results reveal that the learned highlevel feature representations capture a lot of structure in the unlabeled input data, which is useful for subsequent discriminative tasks, such as classification or regression, even though these tasks are unknown when the deep generative model is being trained.
paper
Why does Unsupervised Pretraining Help Deep Discriminant Learning? ^
Dumitru Erhan, Yoshua Bengio, Aaron Courville, PierreAntoine Manzagol, Pascal Vincent (Université de Montréal)
Recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of autoencoder variants, with impressive results obtained in several areas. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pretraining phase, with a generative model. Even though these new algorithms have enabled training deep models finetuned with a discriminant criterion, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: why does unsupervised pretraining work and why does it work so well? Answering these questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of unsupervised pretraining with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pretraining. The results suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that are better in terms of the underlying data distribution; the evidence from these results supports an unusual regularization explanation for the effect of pretraining.
paper
Unsupervised Learning by Discriminating Data from Artificial Noise ^
Michael Gutmann, Aapo Hyvärinen (University of Helsinki)
Noisecontrastive estimation is a new estimation principle that we have developed for parameterized statistical models. The idea is to train a classifier to discriminate between the observed data and some artificially generated noise, using the model logdensity function in a logistic regression function. It can be proven that this leads to a consistent (convergent) estimator of the parameters. The method is shown to directly work for models where the density function does not integrate to unity (unnormalized models). The normalization constant (partition function) can be estimated like any other parameter. We compare the method with other methods that can be used to estimate unnormalized models, including score matching, contrastive divergence, and maximumlikelihood where the correct normalization is estimated with importance sampling. Simulations show that noisecontrastive estimation offers the best tradeoff between computational and statistical efficiency. The method is then applied to the modeling of natural images.
paper
Posters
Inferring Metacovariates Via an Integrated Generative and Discriminative Model ^
Keith J. Harris, Lisa Hopcroft, Mark Girolami (University of Glasgow)
This paper develops an alternative method for analysing high dimensional data sets that combines model based clustering and multiclass classification. By averaging the covariates within the clusters obtained from model based clustering, we define “metacovariates” and use them to build a multinomial probit regression model, thereby selecting clusters of similarly behaving covariates, aiding interpretation. This simultaneous learning task is accomplished by a variational EM algorithm that optimises a joint distribution which rewards good performance at both classification and clustering. We explore the performance of our methodology on a well known leukaemia dataset and use the Gene Ontology to interpret our results.
paper
Naïve Bayes vs. Logistic Regression: An Assessment of the Impact of the Misclassification Cost ^
Vidit Jain (University of Massachusetts Amherst)
Recent advances in the asymptotic characterization of generative and discriminative learning have suggested several ways to develop more effective hybrid models. An application of these suggested approaches to a practical problem domain remains nontrivial, perhaps due to the violation of various underlying assumptions. One common assumption corresponds to the choice of equal misclassification cost or the ability to estimate such cost. Here, we investigate the effect of this misclassification cost on the comparison between naïve Bayes and logistic regression. To assess the utility of this comparison for practical domains, we include a comparison of mean average precision values for our experiments. We present the empirical comparison patterns on the LETOR data set to solicit the support from related theoretical results.
paper
Hybrid Model of Conditional Random Field and Support Vector Machine ^
Qinfeng Shi, Mark Reid, Tiberio Caetano (Australia National University and NICTA)
It is known that probabilistic models often converge to the true distribution asymptotically (i.e. fisher consistent). However, the consistency is often useless in practice, since in real world it is impossible to fit the models with infinite many data in a finite time. SVM is fisher inconsistent in multiclass and structured label case, however, it does provide a PAC bound on the true error (known as generalization bound). Is there a model that is fisher consistent for classification and has a generalization bound? We use a naive combination of two models by simply weighted summing up the losses of two. It turns out a surprising theoretical result — the hybrid loss could be fisher consistent in some circumstance and it has a PACbayes bound on its true error.
paper
Weighting Priors for Hybrid Learning Principles ^
Jens Keilwagen (IPK Gatersleben), Jan Grau, Stefan Posch, Marc Strickert, Ivo Grosse (University of HalleWittenberg)
The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and welladapted models has been developed, but only little attention has be payed to the development of different and similarly welladapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too. Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum aposteriori, maximum conditional likelihood, maximum supervised posterior, generativediscriminative tradeoff, and penalized generativediscriminative tradeoff learning principles as special cases. We present an interesting interpretation of this learning principle in case of a special class of priors, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites.
paper
Parameter Estimation in a Hierarchical Model for Species Occupancy ^
Rebecca A. Hutchinson, Thomas G. Dietterich (Oregon State University)
In this paper, we describe a model for the relationship between the occupancy pattern of a species on a landscape and imperfect observations of its presence or absence. The structure in the observation process is incorporated generatively, and environmental inputs are incorporated discriminatively. Our experiments on synthetic data compare two methods for training this model under various regularization schemes. Our results suggest that maximizing the expected loglikelihood of the observations and the unknown true occupancy produces parameter estimates that are closer to the truth than maximizing the conditional likelihood of the observations alone.
paper