Dataset_Dev.jsonl

{"id":"085e6abb-48c7-4388-9d70-901c4b173369","text":"The paper describes an extension of word embedding methods to also provide\r\nrepresentations for phrases and concepts that correspond to words.  The method\r\nworks by fixing an identifier for groups of phrases, words and the concept that\r\nall denote this concept, replace the occurrences of the phrases and words by\r\nthis identifier in the training corpus, creating a \"tagged\" corpus, and then\r\nappending the tagged corpus to the original corpus for training.  The\r\nconcept\/phrase\/word sets are taken from an ontology.  Since the domain of\r\napplication is biomedical, the related corpora and ontologies are used.  The\r\nresearchers also report that they achieve competitive performance on concept similarity and relatedness tasks, indicating the effectiveness of their approach. Moreover, the method showcases its advantage by requiring no human annotation of the corpus, which reduces the time and cost associated with manual efforts. The authors also highlight the superior vocabulary coverage of their embeddings, with more than 3x coverage in terms of vocabulary size compared to existing methods. This aspect is crucial in ensuring a comprehensive representation of concepts, phrases, and words in the embedding space. By jointly embedding these different linguistic units, the proposed method paves the way for a more holistic understanding and analysis of text data in the biomedical domain. Overall, the paper contributes to the field of embedding ontology concepts by presenting a novel weakly-supervised method that captures the relationships between concepts, phrases, and words in an unsupervised manner. The results achieved in terms of performance, coverage, and cost-effectiveness make this approach a valuable addition to the existing literature. However, further experiments and evaluations across different domains and corpora could enhance the generalizability of the proposed method and provide a more comprehensive assessment of its capabilities. Nonetheless, the work presented in this paper is a promising step towards jointly embedding concepts, phrases, and words, and it opens up avenues for future research in the field.","label":94}
{"id":"f8f8e2cb-494d-4a67-be25-d10e07f2c622","text":"Summary: This paper presents a model for embedding words, phrases and concepts\r\ninto vector spaces. To do so, it uses an ontology of concepts, each of which is\r\nmapped to phrases. These phrases are found in text corpora and treated as\r\natomic symbols. Using this, the paper uses what is essentially the skip-gram\r\nmethod to train embeddings for words, the now atomic phrases and also the\r\nconcepts associated with them. The proposed work is evaluated on the task of\r\nconcept similarity and relatedness using UMLS and Yago to act as the backing\r\nontologies.\r\n\r\nStrengths:\r\n\r\nThe The proposed work is evaluated on the task of concept similarity and relatedness using UMLS and Yago to act as the backing ontologies.\r\n\r\nStrengths:\r\n\r\nThe paper addresses an important problem of embedding concepts, phrases, and words into vector spaces, which has important applications in various natural language processing tasks. The use of representative phrases for ontology concepts as distant supervision is a novel approach that allows for the learning of embeddings without expensive manual annotation or reliance on automated concept tagging methods. The paper also provides a comprehensive evaluation of the proposed method on concept similarity and relatedness tasks, comparing the performance of the embeddings with existing methods and demonstrating competitive results. The achieved performance is particularly impressive considering the absence of human corpus annotation and the significant coverage in the vocabulary size.\r\n\r\nWeaknesses:\r\n\r\nOne potential weakness of the proposed method is the reliance on the representative phrases for ontology concepts. While using these phrases as distant supervision allows for the learning of embeddings, it may also introduce noise or biases into the embedding space. It would be beneficial for the authors to further investigate the impact of using different representative phrases and examine whether the embeddings are sensitive to the choice of phrases. Additionally, the paper lacks a detailed discussion of the limitations and potential future directions of the proposed method.\r\n\r\nOverall, this paper presents a novel weakly-supervised method for jointly embedding concepts, phrases, and words using representative phrases as distant supervision. The experimental evaluation demonstrates competitive performance on concept similarity and relatedness tasks, highlighting the effectiveness of the proposed method in capturing semantic relationships. The paper can be further improved by addressing the potential weaknesses mentioned above and providing a more thorough discussion on the limitations and future work of the proposed method.","label":86}
{"id":"2dbf1cd7-ae9d-4f43-8e07-d981e1b97c14","text":"The authors presents a method to jointly embed words, phrases and concepts,\r\nbased on plain text corpora and a manually-constructed ontology, in which\r\nconcepts are represented by one or more phrases. They apply their method in the\r\nmedical domain using the UMLS ontology, and in the general domain using the\r\nYAGO ontology. To evaluate their approach, the authors compare it to simpler\r\nbaselines and prior work, mostly on intrinsic similarity and relatedness\r\nbenchmarks. They use existing benchmarks in the medical domain, and use\r\nmechanical turkers to generate a new general-domain concept similarity and\r\nrelatedness dataset, which they also intend to release. They report results\r\nthat are comparable to prior work.\r\n\r\nStrengths:\r\n\r\n- The proposed joint embedding model is straightforward and makes reasonable\r\nsense to me. Its main value in my mind is in reaching a (configurable) middle\r\nground between treating phrases as atomic units on one hand to considering\r\ntheir\r\ncompositionallity on the other. The same approach is applied to concepts being\r\n\u2018composed\u2019 of several representative phrases.\r\n\r\n-  The paper describes a decent volume of work, including model development,\r\nan additional contribution in the form of a new evaluation dataset, and several\r\nevaluations and analyses performed.\r\n\r\nWeaknesses:\r\n\r\n- The evaluation reported in this paper includes only intrinsic tasks, mainly\r\non similarity\/relatedness datasets. As the authors note, such evaluations are\r\nknown to have very limited power in predicting the utility of embeddings in\r\nextrinsic tasks. Accordingly, it has become recently much more common to\r\ninclude at least one or two extrinsic tasks as part of the evaluation of\r\nembedding models.\r\n\r\n- The similarity\/relatedness evaluation datasets used in the paper are\r\npresented as datasets recording human judgements of similarity between\r\nconcepts. However, if I understand correctly, the actual judgements were made\r\nbased on presenting phrases to the human annotators, and therefore they should\r\nbe considered as phrase similarity datasets, and analyzed as such.\r\n\r\n- The medical concept evaluation dataset, \u2018mini MayoSRS\u2019 is extremely small\r\n(29 pairs), and its larger superset \u2018MayoSRS\u2019 is only a little larger (101\r\npairs) and was reported to have a relatively low human annotator agreement. The\r\nother medical concept evaluation dataset, \u2018UMNSRS\u2019, is more reasonable in\r\nsize, but is based only on concepts that can be represented as single words,\r\nand were represented as such to the human annotators. This should be mentioned\r\nin the paper and makes the relevance of this dataset questionable with respect\r\nto representations of phrases and general concepts. \r\n\r\n- As the authors themselves note, they (quite extensively) fine tune their\r\nhyperparameters on the very same datasets for which they report their results\r\nand compare them with prior work. This makes all the reported results and\r\nanalyses questionable.\r\n\r\n- - The authors could have provided more explanation and justification for their choice of the joint embedding model. While it is described as a 'middle ground' approach, it is still important to discuss why this particular approach was chosen and how it compares to other existing models.\r\n\r\n- The lack of extrinsic task evaluations is a major limitation of the paper. Including such evaluations would have provided a more comprehensive assessment of the proposed method's usefulness in real-world applications.\r\n\r\n- The use of phrase similarity datasets as proxies for concept similarity is a valid concern. The authors should carefully analyze and discuss the implications of using phrase-based datasets instead of concept-based datasets.\r\n\r\n- The small size and low human annotator agreement of the medical concept evaluation datasets are limiting factors. It would be beneficial to include larger and more diverse medical concept datasets to provide a more robust evaluation.\r\n\r\n- The fine-tuning of hyperparameters on the same datasets used for evaluation raises questions about result reliability. The authors should consider conducting robustness experiments by using separate datasets for hyperparameter tuning.\r\n\r\n- Overall, while the paper presents a novel approach to jointly embedding concepts, phrases, and words, it lacks in-depth evaluation and critical analysis of its limitations. I would suggest addressing these weaknesses and conducting further experiments to strengthen the paper's conclusions.","label":403}
{"id":"4de9ac55-2090-4da6-9431-40137992d68b","text":"Dear Authors\r\n\r\nthanks for replying to our review comments, which clarifies some detail\r\nquestions..After carefully reviewing the paper titled 'Context-Dependent Sentiment Analysis in User-Generated Videos', I commend the authors for their insightful contribution to the field of multimodal sentiment analysis. The paper addresses an important research gap by tackling the issue of interdependencies and relations among utterances in videos. The proposed LSTM-based model introduces a novel approach of capturing contextual information from the surrounding utterances within the same video, resulting in improved classification performance compared to the state of the art. One significant strength of this paper is its emphasis on the multimodal aspect of sentiment analysis in videos. By considering not only the textual content but also other modalities such as visual and acoustic cues, the proposed model demonstrates a comprehensive approach to sentiment analysis. This not only enhances the accuracy of sentiment classification but also enables a more nuanced understanding of the sentiment expressed in user-generated videos.The reported performance improvement of 5-10% over the state of the art is impressive and signifies the effectiveness of the proposed model. Additionally, the authors' mention of high robustness to generalizability highlights the practicality and applicability of their approach across various video datasets. However, there are a few aspects that I believe could be further addressed in the paper. Firstly, although the abstract highlights the LSTM-based model, it would be beneficial to provide more technical details about the proposed architecture in the main body of the paper. This would enable readers to better understand the workings of the model and potentially replicate the experiments in their own research.Furthermore, it would be valuable to discuss the limitations of the proposed approach and potential avenues for future research. While the paper focuses on utterances within a video, it would be interesting to explore how the proposed model performs in the presence of longer videos or when capturing temporal dependencies across multiple videos. In conclusion, 'Context-Dependent Sentiment Analysis in User-Generated Videos' presents a significant contribution to the field of multimodal sentiment analysis. The proposed LSTM-based model effectively captures contextual information from utterances within a video, leading to improved sentiment classification performance. I appreciate the authors' efforts and believe that addressing the aforementioned suggestions would enhance the clarity and impact of the paper. Kudos to the authors for their valuable contribution!","label":12}
{"id":"a4a87fad-acac-4b54-bc6d-9ca998ccf34e","text":"- Strengths:\r\n\r\n1) This paper proposed a semi-automated framework (human generation -> auto\r\nexpansion -> human post-editing) to construct a compositional\r\nsemantic similarity evaluation data set.\r\n\r\n2) The proposed framework is used to create a Polish compositional semantic\r\nsimilarity evaluation data set which is useful for future work in developing\r\nPolish compositional semantic models.\r\n\r\n- Weaknesses:\r\n\r\n1) The proposed framework has only been tested on one language. It is not clear\r\nwhether the framework is portable to other languages. For example, the proposed\r\nframework relies on specific language-specific transformation rules, which may not be available for all languages. Further exploration and experimentation is needed to determine the applicability and effectiveness of the proposed framework for languages other than Polish.\r\n\r\n2) The size of the created Polish evaluation dataset is relatively small, consisting of only 10K sentence pairs. This limited size may restrict the generalizability and robustness of the dataset. It would be beneficial to expand the dataset by including more sentence pairs to cover a wider range of linguistic phenomena and improve the statistical significance of the results.\r\n\r\n3) While the paper mentions that the dataset is human-annotated for semantic relatedness and entailment, it does not provide detailed information about the annotation process and inter-annotator agreement. Transparency in the annotation process and reporting of inter-annotator agreement scores would enhance the reliability and reproducibility of the dataset.\r\n\r\nOverall, this paper presents a promising framework for constructing evaluation datasets for compositional distributional semantics models for languages other than English. However, further work is required to address the limitations mentioned above and validate the effectiveness and generalizability of the proposed framework.","label":75}
{"id":"192ef797-fef0-4f95-9750-979c8af4fea2","text":"# Summary\r\n\r\nThis paper presents an empirical study to identify a latent dimension of\r\nsentiment in word embeddings.\r\n\r\n# Strengths\r\n\r\n S1) Tackles a challenging problem of unsupervised sentiment analysis.\r\n\r\n S2) Figure 2, in particular, is a nice visualisation.\r\n\r\n# Weaknesses\r\n\r\n W1) The experiments, in particular, are W1) The experiments, in particular, are not adequately described and lack detailed information about the dataset used, sample size, and evaluation metrics. This makes it difficult to assess the robustness and generalizability of the proposed approach. Additionally, the comparison with the existing unsupervised method is limited to Turney (2002), which may not provide a comprehensive evaluation. It would be beneficial to include other state-of-the-art sentiment analysis models for a more thorough comparison. Moreover, the paper lacks a discussion on the limitations and potential future directions of the proposed method, which is important for the research community to understand the scope and potential improvements of the approach. Overall, while the paper presents an interesting approach for sentiment analysis, further improvements are needed in terms of experiment design, comparisons, and discussion of limitations.","label":42}
{"id":"6d22a695-1b8c-47c3-8e54-0dedd4fa3977","text":"- Strengths\r\nThis paper deals with the issue of finding word polarity orientation in an\r\nunsupervised manner, using word embeddings.\r\n\r\n- Weaknesses\r\nThe paper presents an interesting and useful idea, however, at this moment, it\r\nis not applied to any test case. The ideas on which it is based are explained\r\nin an \"intuitive\" manner and not thoroughly justified. Additionally, the paper lacks a comprehensive evaluation of the proposed approach. It would be beneficial to compare the performance of the vector space model against existing supervised methods to validate its efficacy. Furthermore, the paper could provide more details on the construction of the representative vectors and the process of determining the sentiment orientation value per word. Addressing these weaknesses will strengthen the paper and allow readers to better understand and critically assess the proposed approach.","label":53}
{"id":"37123fb4-dfad-46a5-88f2-18e79d6963fd","text":"Strengths:\r\n\r\n- Innovative idea: sentiment through regularization\r\n- Experiments appear to be done well from a technical point of view\r\n- Useful in-depth analysis of the model\r\n\r\nWeaknesses:\r\n\r\n- Very close to distant supervision\r\n- Mostly poorly informed baselines\r\n\r\nGeneral Discussion:\r\n\r\nThis paper presents an extension of the vanilla LSTM model that\r\nincorporates sentiment information through regularization.  The\r\nintroduction presents the key claims of the paper: Previous CNN\r\napproaches are bad when no phrase-level supervision is present.\r\nPhrase-level annotation is expensive. The contribution of this paper is\r\ninstead a \"simple model\" using other linguistic resources.\r\n\r\nThe related work section provides a good review of sentiment\r\nliterature. However, there is no mention of previous attempts at\r\nlinguistic regularization (e.g., [YOG14]).\r\n\r\nThe explanation of the regularizers in section 4 is rather lengthy and\r\nrepetitive. The listing on p. 3 could very well be merged with the\r\nrespective subsection 4.1-4.4. Notation in this section is inconsistent\r\nand generally hard to follow. Most notably, p is sometimes used with a\r\nsubscript and sometimes with a superscript.  The parameter \\beta is\r\nnever explicitly mentioned in the text. It is not entirely clear to me\r\nwhat constitutes a \"position\" t in the terminology of the paper. t is a\r\nparameter to the LSTM output, so it seems to be the index of a\r\nsentence. Thus, t-1 is the preceding sentence, and p_t is the prediction\r\nfor this sentence. However, the description of the regularizers talks\r\nabout preceding words, not sentences, but still uses. My assumption here\r\nis that p_t is actually overloaded and may either mean the sentiment of\r\na sentence or a word. However, this should be made clearer in the text.\r\n\r\nOne dangerous issue in this paper is that the authors tread a fine line between regularization and distant supervision. While the use of linguistic resources is mentioned as a key contribution, it would have been helpful to see a more detailed explanation of how these resources were incorporated into the model. The experiments are described well and the results show that the proposed models are able to capture the linguistic role of sentiment and intensity words. However, the baselines used in the experiments appear to be mostly poorly informed, which raises concerns about the robustness of the proposed models. Additionally, it would have been beneficial to see a more comprehensive comparison with existing state-of-the-art models for sentiment classification. The related work section provides a good overview of sentiment literature, but it lacks discussion on previous attempts at linguistic regularization, such as [YOG14]. The explanation of the regularizers in Section 4 could be improved by merging the listing on page 3 with the respective subsections. Furthermore, the inconsistency in notation and lack of clarity around what constitutes a \"position\" in the terminology of the paper make it difficult to follow the details of the model. Overall, the paper presents an innovative idea and provides useful analysis, but there are some weaknesses that should be addressed in order to strengthen the paper's contribution to the field of sentiment classification.","label":261}
{"id":"80f54650-21a7-4afd-a32e-ac40d54ad132","text":"- Strengths:\r\nThis paper proposes a nice way to combine the neural model (LSTM) with\r\nlinguistic knowledge (sentiment lexicon, negation and intensity). The method is\r\nsimple yet effective. It achieves the state-of-the-art performance on Movie\r\nReview dataset and is competitive against the best models on SST dataset.    \r\n\r\n- Weaknesses:\r\nSimilar idea has also been used in (Teng et al., 2016). Though this work is \r\nmore elegant in exploring the linguistic aspects. Furthermore, the experimental results demonstrate the effectiveness of the proposed models. The models achieve state-of-the-art performance on the Movie Review dataset and remain competitive with the best models on the SST dataset. This indicates that the combination of the neural model with linguistic resources greatly enhances the sentiment classification task. However, it is important to note that similar ideas have been used in a previous work by Teng et al. (2016). Although this current work appears to present a more elegant approach in incorporating linguistic knowledge, it would be beneficial to provide a more detailed comparison or analysis to highlight the improvements or differences between the two approaches. Additionally, it would be interesting to explore the generalizability of the proposed models by evaluating them on other sentiment classification datasets. Overall, this paper presents a promising contribution to sentence-level sentiment classification by leveraging both linguistic resources and neural network models. With some further analysis and comparisons, it has the potential to make a significant impact in the field.","label":65}
{"id":"fb1150e1-258a-4bcf-8b4e-0b0e986c5a56","text":"- Strengths:\r\ni. Well organized and easy to understand\r\nii. Provides detailed comparisons under various experimental settings and shows\r\nthe state-of-the-art performances\r\n\r\n- Weaknesses:\r\ni. In experiments, this paper compares previous supervised approaches, but the\r\nproposed method is the semi-supervised approach ii. Provides detailed comparisons under various experimental settings and shows the state-of-the-art performances. \r\n\r\n- Weaknesses: \r\n\r\ni. In experiments, this paper compares previous supervised approaches, but the proposed method is the semi-supervised approach. However, although the paper acknowledges this limitation, it does not explain why a semi-supervised approach was chosen over a fully supervised one. Providing some justification or rationale for this decision would enhance the clarity of the paper. Additionally, while the paper investigates the effectiveness of external training sources for neural word segmentation, it does not thoroughly discuss the limitations and potential drawbacks of relying heavily on such external sources. It would be valuable to include a discussion on the potential issues that could arise when utilizing external data, such as inconsistencies or biases present in the external sources. This discussion would provide a more balanced view of the approach and help readers understand the trade-offs involved. Furthermore, the paper claims that the pretraining significantly improves the model's performance, but it would be helpful to have a more detailed analysis of the specific improvements and how they contribute to the overall accuracy. Providing more insights into the pretraining process and its impact would strengthen the paper's contribution. Overall, addressing these weaknesses would enhance the clarity and thoroughness of the paper.","label":35}
{"id":"4773a3f8-f2a8-4ecf-baf9-1a9519d51269","text":"- Strengths:\r\n\r\nImproves over the state-of-the-art. Method might be applicable for other\r\ndomains.\r\n\r\n- Weaknesses:\r\n\r\nNot much novelty in method.  Not quite clear if data set is general enough for\r\nother domains.\r\n\r\n- General Discussion:\r\n\r\nThis paper describes a rule-based method for generating additional\r\nweakly labeled data for event extraction.  The method has three main\r\nstages.  First, it uses Freebase to find important slot fillers\r\nfor matching sentences in Wikipedia (using all slot fillers is too\r\nstringent resulting in too few matches).  Next, it uses FrameNet to\r\nto improve reliability of labeling trigger verbs and to find nominal\r\ntriggers.  Lastly, it uses a multi-instance learning to deal with\r\nthe noisily generated training data.\r\n\r\nWhat I like about this paper is What I like about this paper is that it tackles the challenging problem of automatically labeling data for large-scale event extraction. The authors propose a rule-based method that utilizes world knowledge and linguistic knowledge to detect key arguments and trigger words for each event type, which is then used to label events in texts automatically. This approach is beneficial because hand-labeled training data is expensive to produce and limited in coverage and size. The experimental results presented in the paper demonstrate that the quality of the automatically labeled data is competitive with human-labeled data, which is impressive. Additionally, the authors show that combining the automatically labeled data with human-labeled data improves the performance of models trained on these combined datasets. This suggests that the automatically labeled data can be effectively incorporated into existing event extraction systems. The paper also discusses the possibility of applying this method to other domains, which indicates potential for generalization. However, one weakness of the paper is that the novelty of the method is not clearly conveyed. It would be helpful to provide more context and explanation around the unique contributions of this approach. Additionally, it is not entirely clear if the dataset generated by this method is general enough for other domains. Overall, this paper presents an interesting approach to address the data labeling problem in event extraction and achieves promising results.","label":110}
{"id":"af7fe908-4176-4c02-86d4-44163faffc13","text":"- Strengths:\r\n\r\nThe paper addresses a long standing problem concerning automatic evaluation of\r\nthe output of generation\/translation systems.\r\n\r\nThe analysis of all the available metrics provides valuable insights into the limitations of existing automatic metrics in reflecting human judgements of NLG system outputs. The authors conduct a detailed error analysis, highlighting the challenges metrics face in distinguishing medium and good quality outputs. The observation that metrics and human judgements are given on different scales further reinforces the need for new evaluation measures. The introduction of RAINBOW, a novel metric that combines the strengths of different automatic scores, is a significant contribution to the field. The achieved correlation of up to rho=.81 on the sentence level outperforms existing metrics and demonstrates the effectiveness and stability of RAINBOW across different systems and datasets. Overall, the paper addresses an important issue and proposes a promising solution.","label":22}
{"id":"73ec3110-f0db-45fb-95df-fe3609faccc9","text":"This paper addresses the network embedding problem by introducing a neural\r\nnetwork model which uses both the network structure and associated text on the\r\nnodes, with an attention model to vary the textual representation based on the\r\ntext of the neighboring nodes.\r\n\r\n- Strengths:\r\n\r\nThe model leverages both the network and the text to construct the latent\r\nrepresentations, and the mutual attention approach seems sensible.\r\n\r\nA relatively thorough evaluation is provided, with multiple datasets,\r\nbaselines, and evaluation tasks.\r\n\r\n- Weaknesses:\r\n\r\nLike many other papers in the \"network embedding\" literature, which use neural\r\nnetwork techniques inspired by word embeddings to construct latent\r\nrepresentations of nodes in a network, the previous line of work on\r\nstatistical\/probabilistic modeling of networks is ignored..Like many other papers in the 'network embedding' literature, which use neural network techniques inspired by word embeddings to construct latent representations of nodes in a network, the previous line of work on statistical\/probabilistic modeling of networks is ignored. This is a significant weakness of the paper as it limits the comparison and evaluation of the proposed model against more traditional approaches. While the neural network-based approach has shown promising results in various domains, it is important to consider how it compares to other methods, especially those that have been extensively studied in the network analysis field. It would have been beneficial to include a comparison with baseline statistical models, such as stochastic block models or latent space models, to provide a more comprehensive understanding of the performance and limitations of CANE. Additionally, the evaluation focuses on link prediction and vertex classification tasks, which are commonly used in the network embedding literature. However, it would have been useful to include other relevant tasks, such as community detection or network visualization, to assess the capabilities of CANE in a broader context. Expanding the evaluation to cover a wider range of tasks and including a comparison with statistical models would strengthen the paper and improve its overall contribution to the field of network analysis.","label":106}
{"id":"72ef2907-dd91-452c-bf44-db402589fcf2","text":"The paper introduces a simple and effective method for morphological paradigm\r\ncompletion in low-resource settings. The method uses a character-based seq2seq\r\nmodel trained on a mix of examples in two languages: a resource-poor language\r\nand a closely-related resource-rich language; each training example is\r\nannotated with a paradigm properties and a language ID. Thus, the model enables\r\ntransfer learning across languages The paper presents an innovative approach to address the problem of paradigm completion in low-resource settings, specifically focusing on cross-lingual transfer. The proposed method leverages the power of a character-based seq2seq model that has been trained on a combination of annotated examples from a resource-poor language and a closely-related resource-rich language. By incorporating paradigm properties and language ID during training, the model enables effective transfer learning across different languages.\r\n\r\nThe authors demonstrate the efficacy of their approach through comprehensive experiments involving 21 language pairs from diverse language families. The results indicate that the proposed method outperforms existing techniques, achieving up to 58% higher accuracy with the aid of cross-lingual transfer. Moreover, the findings unveil the intriguing possibility of zero-shot and one-shot learning in paradigm completion tasks, which is a remarkable contribution to the field.\r\n\r\nAn intriguing aspect of the study is the investigation into the influence of language relatedness on the transferability of morphological knowledge. The authors observe a strong correlation between language relatedness and the success of cross-lingual transfer, shedding light on the limitations and challenges in transferring morphological information across distantly-related languages. This analysis adds valuable insights to the broader understanding of multilingual NLP tasks.\r\n\r\nOverall, the paper offers a concise yet comprehensive overview of the proposed method, its effectiveness in low-resource scenarios, and the potential for cross-lingual transfer in enhancing paradigm completion tasks. The methodology is clearly described, allowing for reproducibility, and the experimental results are well-presented and thoroughly analyzed. However, as the reviewers, we would appreciate additional information regarding the limitations and future directions of the proposed approach. Additionally, it would be beneficial to compare the proposed method with other state-of-the-art multilingual models to provide a more comprehensive evaluation.","label":55}
{"id":"bfd31524-cb6d-4ae0-b2fe-5ba49ec225a3","text":"- Strengths:\r\nThe approach described in the manuscript outperformed the previous approaches\r\nand achieved the state-of-the-art result.\r\n\r\nRegarding data, the method used the combination of market and text data.\r\n\r\nThe approach used word embeddings to define the weight of each lexicon term by\r\nextending it to the similar terms in the document.\r\n\r\n- Weaknesses:\r\nDeep-learning based methods were known to be able to achieve relatively good\r\nperformances without much feature engineering. Therefore, it would be interesting to explore the potential of deep learning models in this context and compare their performance with the word embedding-based approach. Additionally, the manuscript could benefit from a more in-depth analysis of the characteristics of the reports in different financial sectors to identify any sector-specific trends or patterns. Overall, the paper presents a promising approach to volatility prediction using financial disclosures sentiments, but further investigation is needed to fully understand the potential of different modeling techniques and their applicability to different sectors.","label":62}
{"id":"146b28cb-7b7d-48df-b01d-0230ccc73c5f","text":"- Strengths:\r\n\r\n- Weaknesses:\r\n\r\n- General Discussion:\r\n\r\nThis paper investigates sentiment signals in  companies\u2019 annual 10-K filing\r\nreports to forecast volatility. \r\n\r\nThe authors evaluate information retrieval term weighting models which are\r\nseeded with a finance-oriented sentiment lexicon and expanded with word\r\nembeddings..The authors evaluate information retrieval term weighting models which are seeded with a finance-oriented sentiment lexicon and expanded with word embeddings. They also study different fusion methods to combine text and market data resources. The word embedding-based approach proposed in this paper outperforms state-of-the-art methods in volatility prediction using financial disclosures sentiments. Furthermore, the authors investigate the characteristics of the reports of companies in different financial sectors. Overall, this research provides valuable insights into the use of sentiment analysis in forecasting market volatility and contributes to the existing literature in this field.","label":37}
{"id":"056d6e7f-6f30-44c3-ad27-0f888f771d3e","text":"This paper presents a dialogue agent where the belief tracker and the dialogue\r\nmanager are jointly optimised using the reinforce algorithm. It learns from\r\ninteraction with a user simulator. There are two training phases. The first is\r\nan imitation learning phase where the system is initialised using supervising\r\nlearning from a rule-based model. Then there is a reinforcement learning phase\r\nwhere the system has jointly been optimised using the RL objective.\r\n\r\n- Strengths: This paper presents a framework where This paper presents a framework where the belief tracker and the dialogue manager are jointly optimized using the reinforce algorithm. It learns from interaction with a user simulator, making it practical and scalable. The two training phases, imitation learning and reinforcement learning, allow the system to be initialized using supervising learning from a rule-based model and then further refined using RL objectives. This approach addresses the limitation of previous dialogue agents by replacing symbolic queries with a soft retrieval process, enabling end-to-end training of neural dialogue agents. The soft retrieval process, integrated with a reinforcement learner, not only improves task success rate and reward in simulations but also against real users. The paper also highlights the potential of the proposed fully neural end-to-end agent, which is trained entirely from user feedback, for building personalized dialogue agents. Overall, this paper makes a significant contribution to the field of dialogue agents for information access by introducing a novel approach that overcomes the differentiability issue and achieves improved performance through end-to-end reinforcement learning.","label":73}
{"id":"2237fc39-bf0e-4e9c-9748-237f0af6ac50","text":"- Strengths: this paper addresses (in part) the problem of interpreting Long\r\nShort-Term Memory (LSTM) neural network models trained to categorize written\r\njustifications in values-affirmation essays. This is definitely an interesting\r\nresearch question. To do so, the authors want to rely on approaches that have\r\nare standard in experimental psychology. Furthermore, the authors also aim at\r\nvalidating sociological assumptions via this study.\r\n\r\n- Weaknesses: one of the main weaknesses of the paper lies in the fact that the\r\ngoals are not clear enough. One overall, ambitious goal put forward by the\r\nauthors is to use approaches from experimental psychology to interpret LSTMs.\r\nHowever, no clear methodology to do so is presented in the paper. On the other\r\nhand, if the goal is to validate sociological assumptions, then one should do\r\nso by studying the relationships between gender markers and the written\r\njustifications, independently on any model. The claim that \"expected gender\r\ndifferences (are) a function of theories of gendered self-construal\" is not\r\nproven in the study.\r\n\r\n- General Discussion: if the study is interesting, it suffers from several weak\r\narguments. First of all, the fact that First of all, the fact that the methodology to interpret LSTMs is not clear enough raises questions about the validity of the results. Without a well-defined and transparent methodology, it is difficult to draw meaningful conclusions from the study. Additionally, the claim that expected gender differences are a result of theories of gendered self-construal is not adequately supported by the study's findings. While the authors mention studying the relationships between gender markers and written justifications, it remains unclear how this is related to the interpretation of LSTMs. Furthermore, the paper lacks a comprehensive discussion on the limitations of the study. It is crucial to address potential confounding variables and alternative explanations that could impact the observed results. Without a thorough consideration of these limitations, the conclusions drawn from the study may be premature or misguided. Overall, while the research question is interesting and has the potential to contribute to the field, the paper needs significant revisions in terms of methodology, clarity of goals, and discussion of limitations in order to provide robust and reliable findings.","label":170}
{"id":"f4e923b3-8246-4875-9458-9ac151b2a34a","text":"- Strengths:\r\nThe paper is thoroughly written and discusses its approach compared to\r\nother approaches. The authors are aware that their findings are limited to a specific type of essay, values-affirmation essays, and acknowledge the need for future research to explore the generalizability of their findings. The methodology of using neural networks to interpret written justifications is innovative and can provide valuable insights into understanding the underlying thought processes of individuals when they write such essays. The paper also highlights the significance of values-affirmation interventions in influencing positive attitudes and behaviors. The implications of this research could extend beyond the academic realm and have practical applications in fields such as education, psychology, and marketing. Overall, this paper contributes to the growing body of research on interpreting neural networks and provides a solid foundation for further investigation in this area.","label":20}
{"id":"866ad70b-334c-4882-aaed-1955f7bc2f3c","text":"- Strengths:\r\n\r\n1. The presentation of the paper, up until the final few sections, is excellent\r\nand the paper reads very well at the start. The paper has a clear structure and\r\nthe argumentation is, for the most part, good.\r\n2. The paper addresses an important problem by attempting to incorporate word\r\norder information into word (and sense) embeddings and the proposed solution is\r\ninteresting.\r\n\r\n- Weaknesses:\r\n\r\n 1. Unfortunately, the results are rather inconsistent and one is not left\r\nentirely convinced that the proposed models are better than the alternatives,\r\nespecially given the added complexity. Negative results are fine, but there is\r\ninsufficient analysis to learn from them. Moreover, no results are reported on\r\nthe word analogy task, besides being told that the proposed models were not\r\ncompetitive - this could have been interesting and analyzed further.\r\n2. Some aspects of the experimental setup were unclear or poorly motivated, such as the choice of hyperparameters and the selection of the semantic resources used. The justification for using a bidirectional LSTM is not fully explained, and it would have been helpful to see a comparison with a unidirectional LSTM to assess the impact of bidirectionality. Additionally, while the paper mentions that the proposed model outperforms most popular algorithms for learning embeddings, the specific metrics and evaluations used are not clearly described. This makes it difficult to fully understand the extent of the model's superiority and compare it with other approaches. Furthermore, the lack of results on the word analogy task is a missed opportunity for deeper analysis and interpretation. Understanding the model's performance on this task could provide valuable insights into its semantic understanding capabilities. Lastly, the paper could benefit from a more detailed discussion on the limitations and future directions of the proposed approach. Addressing these weaknesses and providing more comprehensive and precise experimental results would significantly enhance the impact and clarity of the paper. Overall, while the paper tackles an important problem and presents an interesting solution, further refinement and thorough analysis are needed to fully convince the readers of the effectiveness and superiority of the LSTMEmbed model.","label":136}
{"id":"df2b4dd3-8da1-4545-8107-b637913f4cc1","text":".The paper presents an interesting approach to text similarity measurement called TextFlow. The authors address the limitation of existing measures by leveraging the sequential nature of language. They propose representing input text pairs as continuous curves and using both the actual position of the words and sequence matching to calculate the similarity value. This approach is inspired by DNA sequence alignment algorithms, which is a novel and promising idea. The experiments conducted on 8 different datasets demonstrate promising results in various tasks such as paraphrase detection, textual entailment recognition, and ranking relevance. It would be beneficial if the authors provide more details about the datasets used, including their characteristics and sizes. Furthermore, thorough comparisons with existing text similarity measures would strengthen the credibility of TextFlow. The paper should also discuss the computational efficiency of TextFlow compared to other measures.Additionally, it would be helpful to explain the process of representing text pairs as continuous curves in more detail. How are the word positions encoded into curves? Are there any parameters or hyperparameters involved in this representation? Providing such information would enhance the reproducibility of the work.The paper introduces an important concept in the field of text similarity measures, but some aspects require further clarification. Despite these minor concerns, the presented results are encouraging, and TextFlow has the potential to be a valuable addition to the existing literature on text similarity. Overall, this paper presents a novel and promising approach that could contribute significantly to various natural language processing tasks.","label":0}
{"id":"177007f0-a7bc-4ea7-8c36-525ac029baf4","text":"- Strengths:\r\nNicely written and understandable.\r\nClearly organized. Targeted answering of research questions, based on \r\ndifferent experiments.\r\n\r\n- Weaknesses:\r\nMinimal novelty. The \"first sentence\" heuristic has been in the summarization\r\nliterature for many years. This work essentially applies this heuristic\r\n(evolved) in the keyword extraction setting. This is NOT to say that the work\r\nis trivial: it is just not really novel.\r\n\r\nLack of empirical evidence or comparison with existing approaches to validate the effectiveness of the proposed method. While the authors claim remarkable improvements in performance over PageRank models and strong baselines, it would be beneficial to provide more detailed analysis and discussion on the limitations of the proposed approach. Additionally, although the abstract mentions the incorporation of word positions, there is limited information on how this aspect is actually integrated into the biased PageRank algorithm. A more thorough explanation of the PositionRank model, including the specific steps and calculations involved, would help readers better understand the methodology. Overall, the paper has clear organization and presents an interesting approach, but it would greatly benefit from further exploration of its novelty and the inclusion of more in-depth evaluation results.","label":56}
{"id":"4bf6de2f-a5ab-4d76-af9e-eb74d734e4de","text":"- Strengths:\r\n\r\nThe paper makes several novel contributions to (transition-based) dependency\r\nparsing by extending the notion of non-monotonic transition systems and dynamic\r\noracles to unrestricted non-projective dependency parsing. The theoretical and\r\nalgorithmic analysis is clear and insightful, and the paper is admirably clear.\r\n\r\n- Weaknesses:\r\n\r\nGiven that the main motivation for using Covington's algorithm is to be able to\r\nrecover non-projective arcs, an empirical error analysis focusing on\r\nnon-projective structures would have further shed light on the benefits and limitations of the proposed approach. Additionally, the paper could have provided a more detailed comparison with other state-of-the-art parsing systems to provide a clearer understanding of the improvements achieved. Moreover, while the experiments on the CoNLL-X and CoNLL-XI datasets provide valuable insights into the performance of the proposed non-monotonic dynamic oracle, it would have been beneficial to include experiments on additional datasets to further validate the effectiveness of the approach across different domains and languages. The paper mentions that the non-monotonic dynamic oracle outperforms the monotonic version in the majority of languages, but it would have been interesting to see a breakdown of the results for each language individually. Additionally, the paper briefly mentions the development of several non-monotonic variants of the dynamic oracle, but does not provide a thorough exploration of these variants or a comparative analysis of their performance. Providing more details on these variants and their impact on parsing accuracy would have strengthened the paper. Finally, the paper could have included a discussion on the computational complexity of the proposed non-monotonic transition system and any potential trade-offs between accuracy and efficiency. Overall, the paper presents a valuable contribution to the field of non-projective parsing by introducing a fully non-monotonic transition system based on the Covington algorithm. The theoretical and algorithmic analysis is clear, and the experiments demonstrate promising results. However, additional empirical evaluations, such as error analysis, comparison with other parsing systems, and experimentation on different datasets, would have further enriched the paper and helped establish a stronger understanding of its contributions and limitations.","label":65}
{"id":"3b805154-e06d-4d91-abea-886d4e514599","text":"- Strengths:\r\n\r\n- this article puts two fields together: text readability for humans and\r\nmachine comprehension of texts\r\n\r\n- Weaknesses:\r\n\r\n- The goal of your paper is not entirely clear. I had to read the paper 4 times\r\nand I still do not understand what you are talking about!\r\n- The article is highly ambiguous what it talks about - machine comprehension\r\nor text comprehension. - The paper clearly states that it aims to evaluate the quality of reading comprehension datasets using two classes of metrics: prerequisite skills and readability. This is an important research goal as it helps in the development of natural language understanding systems. By analyzing six existing datasets, including MCTest and SQuAD, the authors were able to highlight the characteristics of these datasets based on each metric and also explore the correlation between prerequisite skills and readability. This analysis provided valuable insights on the impact of readability on question difficulty. - The strength of the paper lies in its unique approach of combining the fields of text readability for humans and machine comprehension. This interdisciplinary perspective adds depth to the evaluation process as it considers both human and machine factors. By focusing on the correlation between readability and question difficulty, the authors have brought attention to the possibility of creating an RC dataset that is easy to read but difficult to answer. This finding challenges the traditional assumption that difficult text automatically leads to difficult comprehension. - However, the weakness mentioned in the review is valid. The goal and content of the paper could be presented more clearly. The abstract should have provided a clearer overview of the research problem, objectives, and methodology. Additionally, the paper needs to clearly differentiate between the concepts of machine comprehension and text readability and how they are related to each other. This clarification will help readers understand the main focus of the study and avoid confusion. Overall, the paper presents an interesting approach to evaluating RC datasets and provides valuable insights into the relationship between prerequisite skills, readability, and question difficulty. With some improvements in terms of clarity and explanation, this research can make a valuable contribution to the field of natural language understanding and machine comprehension.","label":57}
{"id":"401a0132-6991-4b7d-9510-24f1893d69fb","text":"The paper explores the use of probabilistic models (gaussian processes) to\r\nregress on the target variable of post-editing time\/rates for quality\r\nestimation of MT output.\r\nThe paper is well structured with a clear introduction that highlights the\r\nproblem of QE point estimates in real-world applications. I especially liked\r\nthe description of the different asymmetric risk scenarios and how they entail\r\ndifferent estimators.\r\nFor readers familiar with GPs the paper spends quite some space to reflect\r\nthem, but I think it is worth the effort to introduce these concepts to the\r\nreader.\r\nThe GP approach and the choices for kernels and using warping are explained\r\nvery clearly and are easy to follow. In general the research questions that are\r\nto be answered by this paper are interesting and well phrased.\r\n\r\nHowever, I do have some questions\/suggestions about the Results and Discussion\r\nsections for Intrinsic Uncertainty Evaluation:\r\n- Why were post-editing rates chosen over prediction (H)TER? TER is a common\r\nvalue to predict in QE research and it would have been nice to justify the\r\nchoice made in the paper.\r\n- Section 3.2: I don't understand the first paragraph at all: What exactly is\r\nthe trend you see for fr-en & Section 3.2: I don't understand the first paragraph at all: What exactly is the trend you see for fr-en & English?\r\n\r\nIn the Results and Discussion sections for Intrinsic Uncertainty Evaluation, it would have been helpful to provide a more detailed explanation of the choice to use post-editing rates as the target variable instead of prediction HTER. While HTER is commonly used in QE research, the justification for selecting post-editing rates is not clear in the paper. Perhaps providing some insights into how post-editing rates relate to quality estimation would clarify this decision.\r\n\r\nAdditionally, I noticed that the paper presents the results for fr-en and English, but it does not fully explain the observed trends in the first paragraph of Section 3.2. It would be beneficial to elaborate more on the trends observed for these language pairs and provide more context to help readers better understand the findings.\r\n\r\nOverall, the paper offers a well-structured introduction and presents interesting research questions related to probabilistic methods for Quality Estimation. The description of Gaussian processes and the choices made regarding kernels and warping are explained clearly and are easy to follow. However, further clarification and justification would enhance the Results and Discussion sections, particularly regarding the choice of post-editing rates as the target variable and the trends observed in fr-en and English language pairs.","label":180}
{"id":"ab986a21-cfcd-4588-9ac5-eb979004dd8b","text":"- Strengths:\r\nThe idea of hard monotonic attention is new and substantially different from\r\nothers.\r\n\r\n- Weaknesses:\r\nThe experiment results on morphological inflection generation is The experiment results on morphological inflection generation is highly impressive. The authors evaluate their model on three previously studied morphological inflection generation datasets and demonstrate that it outperforms previous neural and non-neural approaches. This indicates the effectiveness of the proposed hard attention mechanism in capturing the nearly-monotonic alignment between characters in a word and its inflection. By achieving state-of-the-art results in various setups, the model showcases its generalization capability and robustness. Furthermore, the authors provide a comprehensive analysis of the continuous representations learned by both the hard and soft attention models, which offers valuable insights into the features extracted by such models. Overall, this paper not only presents a novel approach to morphological inflection generation, but also contributes to our understanding of attention mechanisms in neural models. It is a significant contribution to the field, and I highly recommend its acceptance for publication.","label":21}
{"id":"74e9e074-1975-413e-8b88-63d17ae74784","text":"- Strengths: A new encoder-decoder model is proposed that explicitly takes \r\ninto account monotonicity.\r\n\r\n- Weaknesses: Maybe the model is just an ordinary BiRNN with alignments\r\nde-coupled.\r\nOnly evaluated on morphology, no other monotone Seq2Seq tasks.\r\n\r\n- General Discussion:\r\n\r\nThe authors propose a novel encoder-decoder neural network architecture with\r\n\"hard monotonic attention\". They evaluate it on three morphology datasets.\r\n\r\nThis paper is a tough one. One the one hand it is well-written, mostly very\r\nclear and also presents a novel idea, namely including monotonicity in\r\nmorphology tasks. \r\n\r\nThe reason for including such monotonicity is pretty obvious: Unlike machine\r\ntranslation, many seq2seq tasks are monotone, and therefore general\r\nencoder-decoder models should not be used in the first place. That they still\r\nperform reasonably well should be considered a strong argument for neural\r\ntechniques, in general. The idea of this paper is now to explicity enforce a\r\nmonotonic output character generation. They do this by decoupling alignment and\r\ntransduction and first aligning input-output sequences monotonically and\r\nthen training to generate outputs in agreement with the monotone alignments.\r\nHowever, the authors are unclear on this point. I have a few questions:\r\n\r\n1) How do your alignments look like? On the one hand, the alignments seem to\r\nbe of the kind 1-to-many (as in the running example, Fig.1), that is, 1 input\r\ncharacter can be aligned with zero, 1, or several output characters. However,\r\nthis seems to contrast with the description given in lines 311-312 where the\r\nauthors speak of several input characters aligned to 1 output character. That\r\nis, do you use 1-to-many, many-to-1 or many-to-many alignments?\r\n\r\n2) Actually, there is a quite simple approach to monotone Seq2Seq. In a first\r\nstage, align input and output characters monotonically with a 1-to-many\r\nconstraint (one can use any monotone aligner, such as the toolkit of\r\nJiampojamarn and Kondrak). Then one trains a standard sequence tagger(!) to\r\npredict exactly these 1-to-many alignments. For example, flog->fliege (your\r\nexample on l.613): First align as in \"f-l-o-g \/ f-l-ie-ge\". Now use any tagger\r\n(could use an LSTM, if you like) to predict \"f-l-ie-ge\" (sequence of length 4)\r\nfrom \"f-l-o-g\" (sequence of length 4)..Based on the partial review, the paper: 'Morphological Inflection Generation with Hard Monotonic Attention' presents a novel idea of incorporating monotonicity in morphological tasks using a new encoder-decoder model with 'hard monotonic attention'. The authors address the limitation of general encoder-decoder models in handling monotone sequence-to-sequence tasks and propose an approach that enforces explicit monotonic output character generation.\r\n\r\nThe paper is well-written and effectively argues for the importance of considering monotonicity in neural techniques for sequence generation. The experimental evaluation of the proposed model on three morphology datasets showcases its state-of-the-art performance compared to both neural and non-neural approaches. However, a few concerns and questions arise from the partial review:\r\n\r\n1) The clarity regarding the alignments could be improved. While the running example suggests 1-to-many alignments, the authors mention several input characters aligned to 1 output character in some instances. It would be beneficial for the authors to provide further explanation and clarification regarding the types of alignments employed (1-to-many, many-to-1, or many-to-many).\r\n\r\n2) The reviewer suggests a simpler approach for monotone Seq2Seq tasks using a two-step process involving monotonic alignment and training a standard sequence tagger. It would be valuable for the authors to address this suggestion and discuss how their proposed model differs and potentially outperforms this alternate approach.\r\n\r\nOverall, the paper presents a promising contribution to the field of morphological inflection generation by incorporating explicit monotonicity in sequence generation models. Addressing the concerns and questions raised in the review would further strengthen the paper and provide additional insights into the proposed model's advantages over alternative approaches. ","label":323}
{"id":"83e64935-2aee-4316-853e-bcf27a01ef20","text":"- Strengths: The paper addresses a relevant topic: learning the mapping between\r\nnatural language and KB relations, in the context of QA (where we have only\r\npartial information for one of the arguments), and in the case of having a very\r\nlarge number of possible target relations.\r\n\r\nThe proposal consists in a new method to combine two different representations\r\nof the input text: a word level representation (i.e. with segmentation of words) and a relation level representation (i.e. with abstraction and comparison of relation names). The use of deep residual bidirectional LSTMs adds depth and context to the model, enhancing its ability to detect and classify relations accurately. The integration of entity linking and the proposed relation detector in the KBQA system is another strength of this paper, as it leverages the synergy between the two components. Experimental results demonstrate the effectiveness of the proposed approach, achieving outstanding performance in relation detection and state-of-the-art accuracy in single-relation and multi-relation question answering benchmarks (SimpleQuestions and WebQSP). Overall, this paper presents a valuable contribution to the field of knowledge base question answering by improving neural relation detection and achieving significant performance gains in QA tasks.","label":66}
{"id":"6ddfb805-3232-42b4-9efd-2aa1e3627921","text":"- Strengths:\r\nThis paper proposes a novel approach for dialogue state tracking that benefits\r\nfrom representing slot values with pre-trained embeddings and learns to compose\r\nthem into distributed representations of user utterances and dialogue context.\r\nExperiments performed on two datasets show consistent and significant\r\nimprovements over the baseline of previous delexicalization based approach.\r\nAlternative approaches (i.e., XAVIER, GloVe, Program-SL999) for pre-training\r\nword embeddings have been investigated.\r\n\r\n- Weaknesses:\r\nAlthough one of the main motivations for using embeddings is to generalize to\r\nmore complex dialogue domains where delexicalization may not scale for, the\r\ndatasets used seem limited.    I  although one of the main motivations for using embeddings is to generalize to more complex dialogue domains where delexicalization may not scale for, the datasets used seem limited. Additionally, the paper lacks a thorough analysis of the computational complexity of the proposed Neural Belief Tracking (NBT) framework, which could be important considering the increasing complexity of dialogue systems. Moreover, it would be helpful to include a comparison to other state-of-the-art models in terms of runtime and memory requirements. Nevertheless, the experimental results on two datasets demonstrate consistent and significant improvements over the baseline, indicating the effectiveness of the NBT framework in dialogue state tracking. Overall, this paper presents a promising approach that addresses existing challenges in dialogue domain scaling and provides valuable insights for future research in this area.","label":89}
{"id":"6c5322f9-59cc-4cfb-a244-66f5594867de","text":"This paper presents a neural network-based framework for dialogue state\r\ntracking.\r\nThe main contribution of this work is on learning representations of user\r\nutterances, system outputs, and also ontology entries, all of which are based\r\non pre-trained word vectors.\r\nParticularly for the utterance representation, the authors compared two\r\ndifferent neural network models: NBT-DNN and NBT-CNN.\r\nThe learned representations are combined with each other and finally used in\r\nthe downstream network to make binary decision for a given slot value pair.\r\nThe experiment shows that the proposed framework achieved significant\r\nperformance improvements compared to the baseline with the delexicalized\r\napproach.\r\n\r\nIt's generally a quality work with clear goal, reasonable idea, and improved\r\nresults from previous studies.\r\nBut the paper itself doesn't seem to be very well organized to effectively\r\ndeliver the details especially to readers who are not familiar with this area.\r\n\r\nFirst of all, more formal definition of DST needs to be given at the beginning\r\nof this paper.\r\nIt is not clear enough and could be more confusing after coupling with SLU.\r\nMy suggestion is to provide a general architecture of dialogue system described\r\nin Section 1 rather than Section 2, followed by the problem definition of DST\r\nfocusing on its relationships to other components of the dialogue system. Additionally, the authors should provide a more detailed explanation of how the NBT framework works, including the specific algorithms and techniques used for learning and composing the pre-trained word vectors. This would help readers understand the technical aspects of the proposed approach. The evaluation results presented in the paper are impressive and demonstrate the effectiveness of the NBT framework in comparison to existing models. However, more discussion and analysis of the results is needed to gain deeper insights into the strengths and limitations of the proposed approach. Furthermore, it would be beneficial to include comparisons with other state-of-the-art models in order to provide a more comprehensive evaluation. Overall, this paper makes a valuable contribution to the field of dialogue state tracking by introducing the NBT framework. With some improvements in organization and more in-depth analysis, this work has the potential to have a significant impact in the development of dialogue systems.","label":185}
{"id":"566e49ab-0658-4bdc-a184-8145815d8ae9","text":"This paper presents evaluation metrics for lyrics generation exploring the need\r\nfor the lyrics to be original,but in a similar style to an artist whilst being\r\nfluent and co-herent. The paper is well written and the motivation for the\r\nmetrics are well explained.  \r\n\r\nThe authors describe both hand annotated metrics (fluency, co-herence and\r\nmatch) and an automatic metric for \u2018Similarity'. Whilst the metric for\r\nSimilarity is unique and innovative, the hand annotated metrics provide a more comprehensive evaluation. The authors recognize that evaluating creative language generation, especially in the context of rap lyrics, is a challenging task that requires considering various aspects of language use and style. The inclusion of annotations for stylistic similarity in the corpus of lyrics for 13 rap artists is a valuable contribution to the field. However, it would have been beneficial if the authors discussed the limitations and potential biases associated with manual evaluation. Additionally, the paper could benefit from a more detailed explanation of the automatic similarity metric and how it complements the manual metrics. Overall, this paper offers a promising approach to evaluating creative language generation and provides a solid foundation for future research in rap lyric ghostwriting.","label":63}
{"id":"a114bbfc-ec3d-48f5-904a-899283728037","text":".The paper 'Evaluating Creative Language Generation: The Case of Rap Lyric Ghostwriting' addresses a crucial issue in evaluating language generation tasks that aim to mimic human creativity. The authors highlight the challenges faced in evaluating such tasks, particularly in considering elements like creativity, style, and other non-trivial aspects of the generated text. The specific task examined in this study is ghostwriting of rap lyrics, which requires producing content that is similar in style to the emulated artist while being distinct in content.\r\n\r\nOne of the strengths of this paper is the development of a novel evaluation methodology that encompasses several complementary aspects of the ghostwriting task. By creating a corpus of lyrics for 13 rap artists, annotated for stylistic similarity, the authors demonstrate the feasibility of manual evaluation for generated verse. This contribution is significant as it lays the foundation for future research in this area and provides a quantifiable basis for evaluating the performance of systems engaged in rap lyric ghostwriting. \r\n\r\nFurthermore, the authors demonstrate the practical usefulness of their evaluation methodology by analyzing system performance, which adds credibility to their approach. By utilizing the annotated corpus, they can measure stylistic similarity between generated verses and the target artists, providing valuable insights into the effectiveness of their system. This empirical validation adds substantial value to the paper, as it establishes a benchmark for evaluating rap lyric ghostwriting systems.\r\n\r\nAnother notable aspect of this work is the attention given to the broader implications and future directions of the task. By outlining the goals for rap lyric ghostwriting and its potential applications, the authors promote further research in this area. This forward-thinking approach enhances the significance of the paper by showcasing the potential impact of evaluating and advancing creative language generation systems.\r\n\r\nIn conclusion, the paper 'Evaluating Creative Language Generation: The Case of Rap Lyric Ghostwriting' presents a valuable contribution to the field. The authors' development of a novel evaluation methodology and the creation of an annotated corpus facilitate the assessment and analysis of rap lyric ghostwriting systems. The practical application of their methodology and the consideration of future directions further strengthen the paper's importance. Overall, this work opens up new avenues for evaluating and improving language generation systems in creative domains, making significant contributions to the field of natural language processing.","label":0}
{"id":"861ec1d9-ea3f-426e-85ac-f46da8b23825","text":"This paper proposes to present a more comprehensive evaluation methodology for\r\nthe assessment of automatically generated rap lyrics (as being similar to a\r\ntarget artist).  While the assessment of the generation of creative work is\r\nvery challenging and of great interest to the community, this effort falls\r\nshort of its claims of a comprehensive solution to this problem.\r\n\r\nAll assessment of this nature ultimately falls to a subjective measure -- can\r\nthe generated sample convince an expert that the generated sample was produced\r\nby the true artist rather than an automated preocess?  This is essentially a\r\nmore specific version of a Turing Test.   The effort to automate some parts of\r\nthe evaluation to aid in optimization and to understand how humans assess\r\nartistic similarity is valuable.  However, the specific findings reported in\r\nthis paper do not provide a comprehensive solution to the problem of evaluating automatically generated rap lyrics. The authors highlight the difficulty in evaluating creative language generation tasks and acknowledge that the assessment ultimately relies on subjective measures. They argue that the generation of rap lyrics can be evaluated by whether the generated sample can convince an expert that it was produced by the true artist instead of an automated process. While this is an interesting perspective, it still leaves room for improvement in terms of a more objective evaluation framework.\r\n\r\nThe authors do acknowledge the value of automating parts of the evaluation process to aid in optimization and gaining insights into how humans assess artistic similarity. They present a novel evaluation methodology that considers several complementary aspects of the ghostwriting task. The use of a corpus of lyrics from 13 rap artists, annotated for stylistic similarity, is a significant contribution as it allows for the assessment of the feasibility of manual evaluation for generated verse.\r\n\r\nHowever, the paper could benefit from a more in-depth discussion on the limitations of their evaluation methodology. For instance, it would be valuable to explore how the process of selecting the rap artists and annotating their lyrics for stylistic similarity may introduce biases into the evaluation. Additionally, the authors mention the need for explicit, quantifiable measures, but there is limited discussion on the specific metrics used in their evaluation.\r\n\r\nFurthermore, the paper could provide more insights into the broader implications and applications of their evaluation methods. While the focus is on ghostwriting of rap lyrics, it would be interesting to explore how their evaluation methodology can be adapted and extended to other creative language generation tasks.\r\n\r\nIn conclusion, the paper presents a valuable contribution in addressing the challenging task of evaluating creative language generation, specifically in the context of rap lyric ghostwriting. While the proposed evaluation methodology and the provided corpus of annotated lyrics are commendable, there are areas for improvement in terms of addressing limitations and providing more detailed insights. Future directions for this research could involve refining the evaluation metrics and considering the impact of biases in the selection and annotation processes.","label":127}
{"id":"1bb4991a-b778-4085-a3e1-b27a9f383483","text":"- Strengths:\r\ni. Motivation is well described.\r\nii. Provides detailed comparisons with various models across diverse languages\r\n\r\n- Weaknesses:\r\ni.          The conclusion is biased by the selected languages. \r\nii.           The experiments do not cover the claim of this paper completely.\r\n\r\n- General Discussion:\r\nThis paper issues a simple but fundamental question about word representation:\r\nwhat subunit of a word is suitable to represent morphologies and how to compose\r\nthe units. To answer this question, this paper applied word representations\r\nwith various subunits (characters, character-trigram, and morphs) iii. The experiments are well-designed and provide valuable insights into the effectiveness of different subunits and their compositions.\r\n\r\niv. The paper acknowledges the limitations of character-level models and highlights the potential for improvement.\r\n\r\n- Strengths:\r\ni. Motivation is well described.\r\nii. Provides detailed comparisons with various models across diverse languages\r\niii. The experiments are well-designed and provide valuable insights into the effectiveness of different subunits and their compositions.\r\n\r\n- Weaknesses:\r\ni. The conclusion is biased by the selected languages.\r\nii. The experiments do not cover the claim of this paper completely.\r\n\r\n- General Discussion:\r\nThis paper issues a simple but fundamental question about word representation: what subunit of a word is suitable to represent morphologies and how to compose the units. To answer this question, this paper applied word representations with various subunits (characters, character-trigram, and morphs). The experiments conducted systematically vary the basic unit of representation, the composition of these representations, and the morphological typology of the language modeled. By doing so, the paper offers valuable insights into the performance and effectiveness of different subunits.\r\n\r\nThe paper's motivation is well-described, and the importance of studying the representation of morphological regularities in words is clearly established. The comparisons with various models across diverse languages help to strengthen the findings and provide a comprehensive analysis. The paper also acknowledges the limitations of character-level models and recognizes the potential for improvement to match the predictive accuracy of models with access to true morphological analyses.\r\n\r\nHowever, one weakness of the paper is the potential bias introduced by the selected languages. It would have been beneficial to include a more diverse set of languages to ensure the generalizability of the findings across different linguistic typologies. Additionally, while the experiments conducted provide valuable insights, it is noted that they do not fully cover the claims of the paper. Further experiments or analysis might be necessary to address this limitation.\r\n\r\nOverall, this paper makes a significant contribution to the field by addressing the fundamental question of word representation and its relationship to morphology. The thorough experimental design and detailed comparisons enhance the credibility of the findings. The weaknesses identified can be addressed in future work to strengthen the conclusions and broaden the applicability of the research.\r\n}","label":96}
{"id":"8e0e1717-7f9c-4bf9-b26d-5385bb120170","text":"- Strengths:\r\n\r\nThis is the first neural network-based approach to argumentation\r\nmining. The proposed method used a Pointer Network (PN) model with\r\nmulti-task learning and outperformed previous methods in the\r\nexperiments on two datasets.\r\n\r\n- Weaknesses:\r\n\r\nThis is basically an application of PN to argumentation\r\nmining. Although the combination of PN and multi-task learning for\r\nthis task is novel, one weakness of the paper is that it lacks a detailed discussion on the limitations of the proposed approach. While the authors show that their joint model achieves state-of-the-art results on two evaluation corpora and outperforms the regular Pointer Network model, they do not thoroughly analyze cases in which the proposed model might fail or produce incorrect results. It is important for future research to investigate and understand the limitations and failure modes of the model in order to identify scenarios where alternative approaches or improvements are necessary. Additionally, the paper could benefit from a more in-depth explanation of the decision behind using a Pointer Network architecture for argumentation mining. While the novelty lies in applying this architecture to the task, it would be useful for readers to understand the underlying reasons for this choice and how it effectively addresses the challenges of link extraction and argument component classification. Further, the paper could provide more insights into the interpretability of the model's predictions. While achieving high performance is important, understanding the reasons behind the model's predictions can help users trust and make better use of the system. Overall, while the paper presents a promising approach to argumentation mining with novel contributions, it would greatly enhance its impact by addressing the mentioned weaknesses and providing more detailed discussions and insights.","label":51}
{"id":"1c735f31-8937-4009-ac5c-d2f56e2862fe","text":"The paper presents an application of Pointer Networks, a recurrent neural\r\nnetwork model original used for solving algorithmic tasks, to two subtasks of\r\nArgumentation Mining: determining the types of Argument Components, and finding\r\nthe links between them. The model achieves state-of-the-art results.\r\n\r\nStrengths:\r\n\r\n- Thorough review of prior art in the specific formulation of argument mining\r\nhandled in this paper.\r\n- Simple and effective modification of an existing model to make it suitable\r\nfor\r\nthe task. The model is mostly explained clearly.\r\n- Strong results as compared to prior art in this task.\r\n\r\nWeaknesses:\r\n\r\n- 071: This formulation of argumentation mining is just one of several proposed\r\nsubtask divisions, and this should be mentioned. For example, in [1], claims\r\nare detected and classified before any supporting evidence is detected.\r\nFurthermore, [2] applied neural networks to this task, so it is inaccurate to\r\nsay (as is claimed in the abstract of this paper) that this work is the first\r\nNN-based approach to argumentation mining.\r\n- Two things must be improved in the presentation of the model: (1) What is the\r\npooling method used for embedding features (line 397)? and (2) Equation (7) in\r\nline 472 is not clear enough: is E_i the random variable representing the\r\n*type* of AC i, or its *identity*?.Overall, this paper provides an interesting and novel approach to argumentation mining, utilizing a Pointer Network architecture to tackle the tasks of extracting links between argument components and classifying the types of argument components. The authors did a good job of reviewing the relevant prior art in this specific formulation of argument mining and highlighting the strengths and weaknesses of their proposed approach. The modification of the existing model to suit the argumentation mining tasks is both simple and effective, and the explanations of the model are generally clear. The achieved results are impressive, outperforming previous methods in this domain.\r\n\r\nHowever, there are a few weaknesses in the paper that need to be addressed. Firstly, it should be noted that this formulation of argumentation mining is just one of several proposed subtask divisions, and the authors should mention this and provide some contextualization. For example, in [1], a different approach is taken where claims are detected and classified before identifying supporting evidence. Additionally, [2] also applied neural networks to this task, so it is not accurate to claim that this work is the first NN-based approach to argumentation mining.\r\n\r\nFurthermore, there are a couple of areas in the presentation of the model that could be improved. Firstly, it is not clear what pooling method is used for embedding features, as mentioned in line 397. The authors should provide more details on this to ensure reproducibility. Secondly, Equation (7) in line 472 is not clear enough. It is unclear whether E_i represents the random variable for the type or the identity of argument component i. Clarifying this equation would help in better understanding the model.\r\n\r\nIn conclusion, while this paper makes a strong contribution to the field of argumentation mining by proposing a novel approach and achieving state-of-the-art results, further improvements can be made in terms of contextualizing the work, providing more information on specific aspects of the model, and clarifying equations for better understanding. These suggestions would enhance the overall quality of the paper.","label":191}
{"id":"eb943ff5-d753-4368-bdb3-67a4baa0dbbe","text":".In this paper, the authors address the problem of relation extraction using an end-to-end neural model with global optimization. They begin by highlighting the recent success of neural networks in relation extraction and mention that state-of-the-art models typically utilize a local classifier for solving the task incrementally. However, the authors note that previous research using statistical models has shown that global optimization can lead to better performance in relation extraction. To address this, the authors propose a novel LSTM-based approach that incorporates global optimization for end-to-end relation extraction. They introduce new LSTM features to enhance representation learning and conduct experiments on two standard benchmarks to evaluate the effectiveness of their proposed model. The experimental results demonstrate that the proposed model achieves the best performance compared to existing methods. The authors effectively establish the significance of global optimization in relation extraction and introduce valuable contributions to the field. Overall, this paper provides a thorough investigation of end-to-end relation extraction using a globally optimized neural model, and the results indicate the potential for further advancements in this area.","label":0}
{"id":"023cc065-86c5-42bf-975c-ec2abcd91b67","text":"- Strengths:\r\nZero-shot relation extraction is an interesting problem. The authors have\r\ncreated a large dataset for relation extraction as question answering which\r\nwould likely be useful to the community.\r\n\r\n- Weaknesses:\r\nComparison and credit to existing work is severely lacking. Contributions of\r\nthe paper don't seen particularly novel.\r\n\r\n- General Discussion:\r\n\r\nThe authors perform relation extraction as reading comprehension. In order to\r\ntrain reading comprehension models to perform relation extraction, they create\r\na large dataset of 30m \u201cquerified\u201d (converted to natural language)\r\nrelations by asking mechanical turk annotators to write natural language\r\nqueries for relations from a schema. They use the reading comprehension model\r\nof Seo et al. 2016, adding the ability to return \u201cno relation,\u201d as the\r\noriginal model must always return an answer. The main motivation\/result of the\r\npaper appears to be that the authors can perform zero-shot relation extraction,\r\nextracting relations only seen at test time.\r\n\r\nThis paper is well-written and the idea is interesting. However, there are\r\ninsufficient experiments and comparison to previous work to convince me that\r\nthe paper\u2019s contributions are novel and impactful.\r\n\r\nFirst, the authors are missing a great deal of related work: Neelakantan at al.\r\n2015 (https:\/\/arxiv.org\/abs\/1504.06662) perform zero-shot relation extraction\r\nusing RNNs over KB paths. Verga et al. 2017 (https:\/\/arxiv.org\/abs\/1606.05804)\r\nperform relation extraction on unseen entities..Overall, while the paper presents an interesting approach to zero-shot relation extraction via reading comprehension, there are several areas that need to be addressed in order to strengthen the contributions and novelty of the work. Firstly, the authors need to provide a more comprehensive review of related work, particularly in the area of zero-shot relation extraction. Neelakantan et al. (2015) and Verga et al. (2017) have explored similar problems and it is important to compare and contrast their approaches with the proposed method. Secondly, the paper lacks in-depth experimental analysis and comparison with existing methods. It would be beneficial to conduct experiments on multiple benchmark datasets and compare the performance of the proposed method with state-of-the-art approaches. Additionally, providing a detailed discussion on the limitations and potential future directions of the work would enhance the overall impact of the paper. Despite these weaknesses, the authors should be commended for creating a large dataset for relation extraction as question answering, which can contribute to advancing research in this area. In conclusion, further improvements and evaluation are necessary to establish the novelty and effectiveness of the proposed zero-shot relation extraction approach.","label":193}
{"id":"57cd0dce-0dde-4aff-86e6-8067b026367c","text":"The paper presents a method for relation extraction based on converting the\r\ntask into a question answering task. The main hypothesis of the paper is that\r\nquestions are a more generic vehicle for carrying content than particular\r\nexamples of relations, and this hypothesis is supported by the experiments conducted in the paper. The authors propose a novel approach where relation extraction is transformed into a reading comprehension task, where natural language questions are associated with each relation slot. This conversion allows for the utilization of recent neural reading-comprehension techniques, enabling the development of high-quality relation-extraction models. Additionally, the authors demonstrate the ability to build large training sets by combining relation-specific crowd-sourced questions with distant supervision. The experiments conducted on a Wikipedia slot-filling task exhibit impressive results, showcasing the model's capacity to generalize to new questions for known relations with high accuracy. Moreover, the zero-shot learning capabilities of the proposed approach are highlighted, as the model can extract new relations that are only specified at test-time, even without labeled training examples. Although the zero-shot generalization to unseen relations achieves lower accuracy levels, it paves the way for future advancements in this field. Overall, the paper offers a promising solution to the challenging task of relation extraction, presenting innovative ideas and achieving notable results.","label":37}
{"id":"29945f4f-b71d-49fd-90ca-ccf4b35cf114","text":"The paper models the relation extraction problem as reading comprehension and\r\nextends a previously proposed reading comprehension (RC) model to extract\r\nunseen relations. The approach has two main components:\r\n\r\n1. Queryfication: Converting a relation into natural question. Authors use\r\ncrowdsourcing for this part.\r\n\r\n2. Applying RC model on the generated questions and sentences to get the answer\r\nspans. Authors extend a previously proposed approach to accommodate situations\r\nwhere there is no correct answer in the sentence.\r\n\r\nMy comments:\r\n\r\n1. The paper reads very well and the approach is clearly explained.\r\n\r\n2. In my opinion, though the idea of using RC for relation extraction is\r\ninteresting and novel, the approach is not novel. A part of the approach is\r\ncrowdsourced and the other part is taken directly from a previous work, as I\r\nmention above.\r\n\r\n3. Relation extraction is a well studied problem and there are plenty of\r\nrecently published works on the problem. However, authors do not compare their\r\nmethods against any of the previous works. This raises suspicion on the\r\neffectiveness of the approach. As seen from the abstract, the authors claim that their approach can learn high-quality relation-extraction models by extending recent neural reading-comprehension techniques and build large training sets using crowdsourced questions and distant supervision. They also mention the possibility of zero-shot learning for new relations that are only specified at test-time. While the approach presented in the paper is interesting, there are some concerns that need to be addressed. Firstly, the approach is not entirely novel, as it combines crowdsourcing and a previously proposed RC model. Secondly, the lack of comparison with other recent works on relation extraction raises questions about the effectiveness of the proposed approach. It would be beneficial for the authors to include a thorough evaluation and comparison with existing methods to provide a clearer understanding of the strengths and limitations of their approach. Overall, the paper is well-written and the idea of using RC for relation extraction is promising, but further experimentation and comparison are needed to establish its significance in the field.","label":161}
{"id":"48c915ac-33d6-4a36-8f95-dceab15b87a7","text":"- Strengths: The idea to investigate the types of relations between lexical\r\nitems is very interesting and challenging..The idea to investigate the types of relations between lexical items is very interesting and challenging. This exploration of pre-trained word embeddings to identify generic types of semantic relations in an unsupervised experiment has the potential to significantly contribute to the field of natural language processing. By proposing a new relational similarity measure based on the combination of word2vec's CBOW input and output vectors, the authors are able to outperform concurrent vector representations, as demonstrated by their results on the SemEval 2010 Relation Classification data. This suggests that the proposed approach holds promise for uncovering meaningful semantic relationships in a variety of contexts.","label":17}
{"id":"0f051234-318a-4176-9b77-6375e1ccd971","text":"This paper investigates the application of distributional vectors of meaning in\r\ntasks that involve the identification of semantic relations, similar to the\r\nanalogical reasoning task of Mikolov et al. (2013): Given an expression of the\r\nform \u201cX is for France what London is for the UK\u201d, X can be approximated by\r\nthe simple vector arithmetic operation London-UK+France. The authors argue that\r\nthis simple method can only capture very specific forms of analogies, and they\r\npresent a measure that aims at identifying a wider range of relations in a more\r\neffective way.\r\n\r\nI n this paper, the authors present a novel approach to exploring vector spaces for identifying semantic relations. They start by discussing the use of word embeddings, which have been successful in tasks involving lexical semantic similarities between individual words. The authors specifically mention the use of unsupervised methods and cosine similarity for obtaining encouraging results in analogical similarities.\r\n\r\nHowever, the authors argue that this simple method can only capture very specific forms of analogies and that there is a need for a more effective measure to identify a wider range of relations. To address this, they propose a new relational similarity measure that combines word2vec's CBOW input and output vectors. This measure outperforms concurrent vector representations when applied to unsupervised clustering on SemEval 2010 Relation Classification data.\r\n\r\nThis paper makes a valuable contribution to the field by expanding the scope of semantic relation identification beyond specific forms of analogies. The approach presented by the authors is both innovative and practical. By leveraging pre-trained word embeddings and introducing a new relational similarity measure, they offer a promising technique for identifying generic types of semantic relations.\r\n\r\nThe experimental evaluation conducted on the SemEval 2010 Relation Classification data provides empirical evidence to support the effectiveness of the proposed approach. The results show that the combination of word2vec's CBOW input and output vectors leads to improved performance in unsupervised clustering.\r\n\r\nOne potential limitation of this work is the reliance on pre-trained word embeddings. While pre-trained embeddings provide a convenient starting point for many natural language processing tasks, they may not always capture domain-specific nuances. Therefore, it would be interesting to explore the performance of the proposed approach with domain-specific embeddings.\r\n\r\nIn conclusion, this paper presents an important exploration of vector spaces for semantic relations. The authors' dedication to investigating different approaches and their in-depth experimental evaluation make this work rigorous and reliable. The proposed measure shows promise in identifying a wider range of semantic relations, and further research can build upon this work to continue advancing the field.","label":84}
{"id":"bb3cbc9b-3f59-4428-9f07-fc696a8548a5","text":"This paper presents a comparison of several vector combination techniques on\r\nthe task of relation classification.\r\n\r\n- Strengths:\r\n\r\nThe paper is clearly written and easy to understand.\r\n\r\n- Weaknesses:\r\n\r\nMy main complaint about the paper is the significance of its contributions. I\r\nbelieve it might be suitable as a preliminary exploration of vector spaces for semantic relations, but it lacks a more comprehensive evaluation and comparison with existing methods. The authors should consider conducting further experiments to validate their findings and provide more insights into the performance of their proposed relational similarity measure. Additionally, the paper would benefit from a more detailed discussion on the limitations and potential future directions of their work. Overall, while the paper presents an interesting approach and shows promising results, it falls short in fully addressing the research problem and providing a solid contribution to the field. Further revisions and improvements are needed to strengthen the paper and make it more impactful.","label":42}
{"id":"59b42c18-c119-4955-8942-c5bfdee001aa","text":"- Strengths:\r\n\r\nThe authors propose a kernel-based method that captures high-order patterns\r\ndifferentiting different types of rumors by evaluating the similarities between\r\ntheir propagation tree structures.\r\n\r\n- Weaknesses:\r\n\r\nmaybe the maths is not always clear in Sect. 4. \r\n\r\n- General Discussion:\r\n\r\nThe authors propose a propagation tree kernel, a The authors propose a propagation tree kernel, a powerful tool for detecting rumors in microblog posts. By modeling the diffusion of microblog posts using propagation trees, they are able to analyze how an original message is transmitted and developed over time, which is crucial in identifying rumors. The kernel-based approach allows for the evaluation of similarities between propagation tree structures, capturing high-order patterns that differentiate different types of rumors. This method shows promising results in terms of both speed and accuracy, outperforming state-of-the-art rumor detection models. However, it would be helpful if the authors could provide more clarity in the mathematical aspects of their approach in Section 4. Overall, this paper contributes significantly to the understanding and detection of rumors in the context of microblogging platforms.","label":43}
{"id":"798c7f72-4c38-4d59-9cd6-8895059f76c9","text":"This paper introduces new configurations and training objectives for neural\r\nsequence models in a multi-task setting. As the authors describe well, the\r\nmulti-task setting is important because some tasks have shared information\r\nand in some scenarios learning many tasks can improve overall performance.\r\n\r\nThe methods section is relatively clear and logical, and I like where it ended\r\nup, though it could be slightly better organized. The organization that I\r\nrealized after reading is that there are two problems: 1) shared features end\r\nup in the private feature space, and 2) private features end up in the \r\nshared space. There is one novel method for each problem. That organization up\r\nfront would make the methods more cohesive. In any case, they introduce one \r\nmethod that keeps task-specific features out of shared representation\r\n(adversarial\r\nloss) and another to keep shared features out of task-specific representations\r\n(orthogonality constraints). My only point of confusion is the adversarial\r\nsystem.\r\nAfter LSTM output there is another layer, D(s^k_T, \\theta_D), relying on\r\nparameters\r\nU and b. This output is considered a probability distribution which is compared\r\nagainst the actual. This means it is possible it will just learn U and b that\r\neffectively mask task-specific information from  the LSTM outputs, and doesn't \r\nseem like it can guarantee task-specific information is removed.\r\n\r\nBefore I read the evaluation section I wrote down what I hoped the experiments\r\nwould look like and it did most of it. This is an interesting idea and there\r\nare \r\na lot more experiments one can imagine but I think here they have done a good job of exploring a variety of text classification tasks. They conducted experiments on a total of 16 different tasks, which demonstrates the robustness of their proposed approach. It is worth noting that the datasets used in these tasks are publicly available, further contributing to the reproducibility and transparency of their work. Additionally, the authors highlight the transferability of the shared knowledge learned by their proposed model. This is an important aspect in multi-task learning, as it allows the learned features to be easily applied to new tasks without the need for extensive retraining. Overall, the paper presents a novel framework for adversarial multi-task learning in text classification and provides extensive experimental evidence to support the effectiveness of their approach. However, there are a few areas that could be further addressed to strengthen the paper. Firstly, the evaluation section could benefit from a more detailed analysis of the experimental results. While the authors demonstrate the benefits of their approach over baselines, it would be insightful to understand the reasons behind these improvements. Additionally, the limitations of their approach should be discussed, along with possible avenues for future research. This would provide readers with a better understanding of the potential risks and challenges associated with implementing such a framework. Lastly, the paper could also benefit from a more extensive discussion of related work. Although the authors briefly mention some existing approaches, a more comprehensive review of prior art in adversarial multi-task learning for text classification would further position their work in the broader research landscape. Overall, the paper presents a strong contribution to the field and opens up new avenues for adversarial multi-task learning in text classification. With some additional improvements and clarifications, this work has the potential to make a significant impact in the field of natural language processing.","label":239}
{"id":"e623e8c6-3ac6-4627-bcc3-2bc02d00819b","text":"# Paper summary\r\n\r\nThis paper presents a method for learning well-partitioned shared and\r\ntask-specific feature spaces for LSTM text classifiers. Multiclass adversarial\r\ntraining encourages shared space representations from which a discriminative\r\nclassifier cannot identify the task source (and are thus generic). The models\r\nevaluates are a fully-shared, shared-private and adversarial shared-private --\r\nthe lattermost ASP model is one of the main contributions. They also use\r\northogonality constraints to help reward shared and private spaces that are\r\ndistinct. The ASP model has lower error rate than single-task and other\r\nmulti-task neural models. They also experiment with a task-level cross\r\nvalidation to explore whether the shared representation can transfer across\r\ntasks, and it seems to favourably..The paper provides a comprehensive overview of the proposed adversarial multi-task learning framework for text classification. The authors address the limitations of existing approaches by introducing a novel approach that separates shared and private latent feature spaces to avoid contamination. The experiments conducted on 16 different text classification tasks demonstrate the benefits of the proposed approach. It is noteworthy that the shared knowledge learned by the proposed model can be easily transferred to new tasks, indicating its potential as off-the-shelf knowledge. The availability of the datasets used in the experiments adds to the reproducibility of the results. The use of multiclass adversarial training and orthogonality constraints further enhances the effectiveness of the shared and private spaces. The authors discuss the significance of the ASP model and its superior performance compared to single-task and other multi-task neural models. However, it would be beneficial to include more details on the experimental setup, such as the specific datasets used, the size of the datasets, and the evaluation metrics employed. Additionally, providing statistical significance tests for the observed improvements would strengthen the validity of the results. It would also be interesting to explore the computational efficiency of the proposed approach and compare it to other methods. Overall, the paper presents a valuable contribution to the field of text classification and multi-task learning, and the results obtained justify further investigation and potential applications of the proposed framework.","label":103}
{"id":"269692f3-67c5-46c5-a93c-c9be9e511da2","text":"The paper describes an idea to learn phrasal representation and facilitate them\r\nin RNN-based language models and neural machine translation\r\n\r\n-Strengths:\r\n\r\nThe  idea to incorporate phrasal information into the task is interesting.\r\n\r\n- Weaknesses:\r\n\r\n- The description is hard to follow. Proof-reading by an English native speaker\r\nwould benefit the understanding\r\n- The evaluation of the approach has several weaknesses\r\n\r\n- General discussion\r\n\r\n- In Equation 1 and 2 the In Equation 1 and 2, the authors introduce their pRNN framework for phrase representation. However, the equations lack explanatory context. It would be helpful to provide more detailed explanations of the variables and their roles in the model. Additionally, the paper does not provide a clear description of the dataset used for evaluation. It is important to specify the size and characteristics of the dataset to understand the reliability and generalizability of the results. Furthermore, the evaluation metrics used are not clearly described. It would be beneficial to include specific details about the metrics and how they were calculated. In terms of strengths, the idea of incorporating phrasal information into RNN-based models is intriguing, as it has the potential to improve performance on language modeling and machine translation tasks. However, a more thorough analysis of the effectiveness of the proposed pRNNs, such as comparison to other state-of-the-art models, would strengthen the paper. Overall, the paper presents an interesting concept but requires further clarification and a more robust evaluation.","label":62}
{"id":"f338c2a2-1d25-4e0e-b33d-0f92bed4b175","text":"This paper proposed a new phrasal RNN architecture for sequence to sequence\r\ngeneration. They have evaluated their architecture based on (i) the language\r\nmodelling test evaluated on PTB and FBIS and (ii) Chinese-English machine\r\ntranslation task on NIST MT02-08 evaluation sets. The phrasal RNN (pRNN)\r\narchitecture is achieved by generating subnetworks of phrases. \r\n\r\nStrengths\r\n====\r\n\r\nA new phrasal architecture. \r\n\r\nWeaknesses\r\n====\r\n\r\n**Technical**: \r\n\r\nIt's unclear whether there is a limit set  on the phrase length of the pRNN.\r\nMaybe I've missed this in the paper, if there is, please be more explicit about\r\nit because it affects the model quite drastically if for every sentence the\r\nlargest phrase length is the sentence length. \r\n\r\n - It's because if the largest phrase length is the sentence length, then model\r\ncan be simplified into a some sort of convolution RNN where the each state of\r\nthe RNN goes through some convolution layer before a final softmax and\r\nattention. \r\n\r\n - If there is a limit set on the phrase length of pRNN, then it makes the\r\nsystem more tractable. But that would also mean that the phrases are determined\r\nby token ngrams which produces a sliding window of the \"pyramid encoders\" for\r\neach sentence where there are instance where the parameter for these phrases\r\nwill be set close to zero to disable the phrases and these phrases would be a\r\ngood intrinsic evaluation of the pRNN in addition to evaluating it purely on\r\nperplexity and BLEU extrinsically. \r\n\r\nThe usage of attention mechanism without some sort of pruning might be\r\nproblematic at the phrasal level. The author have opted for some sort of greedy\r\npruning as described in the caption of figure 4. But I support given a fixed\r\nset of phrase pairs at train time, the attention mechanism at the phrasal level\r\ncan be pre-computed but at inference (apply the attention on new data at test\r\ntime), this might be kind of problematic when the architecture is scaled to a\r\nlarger dataset. \r\n\r\n**Empirical**: \r\n\r\nOne issue with the language modelling experiment is the choice of evaluation\r\nand train set..One issue with the language modelling experiment is the choice of evaluation and train set. The paper mentions using the standard Penn Treebank (PTB) and FBIS English dataset, but it would have been beneficial to provide more information about these datasets. Specifically, it would be helpful to know the size of the datasets and how they were chosen. Additionally, it would be valuable to compare the performance of the proposed phrasal RNN (pRNN) model on other benchmark datasets to demonstrate its generalizability.\r\n\r\nFurthermore, the paper lacks a comprehensive comparison with other state-of-the-art models. While it mentions outperforming a strong sequence-to-sequence baseline in the Chinese-English machine translation task, it would be useful to compare the pRNN model against other popular neural network architectures such as LSTMs and transformers. This would provide a better understanding of the strengths and weaknesses of the pRNN model and its relative performance.\r\n\r\nIn terms of technical weaknesses, the paper could benefit from providing more details about the specific implementation of the pRNN architecture. Although the authors propose the use of the attention mechanism, it is not clear how it is integrated into the model. Additionally, the paper could provide more information about the training process, such as the optimization algorithm used and the hyperparameter settings.\r\n\r\nOverall, the phrasal recurrent neural network (pRNN) architecture proposed in this paper has the potential to be a valuable addition to the field of sequence to sequence generation. However, further experiments and comparisons are needed to validate its effectiveness and understand its limitations.","label":319}
{"id":"5484c13e-5b4b-4043-9cf0-07dd226466da","text":"This paper develops an LSTM-based model for classifying connective uses for\r\nwhether they indicate that a causal relation was intended. The guiding idea is\r\nthat the expression of causal relations is extremely diverse and thus not\r\namenable to syntactic treatment, and that the more abstract representations\r\ndelivered by neural models are therefore more suitable as the basis for making\r\nthese decisions.\r\n\r\nThe experiments are on the AltLex corpus developed by Hidley and McKeown. The\r\nresults offer modest but consistent support for the general idea, and they\r\nprovide some initial insights into how best to translate this idea into a\r\nmodel. The paper distribution includes the TensorFlow-based models used for the\r\nexperiments.\r\n\r\nSome critical comments and questions:\r\n\r\n* The introduction is unusual in that it is more like a literature review than\r\na full overview of what the paper contains. This leads to some redundancy with\r\nthe related work section that follows it. I guess I am open to a non-standard\r\nsort of intro, but this one really doesn't work: despite reviewing a lot of\r\nideas, it doesn't take a stand on what causation is or how it is expressed, but\r\nrather only makes a negative point (it's not reducible to syntax). We aren't\r\nreally told what the positive contribution will be except for the very general\r\nfinal paragraph of the section.\r\n\r\n* Extending the above, I found it disappointing that the paper isn't really\r\nclear about the theory of causation being assumed. The authors seem to default\r\nto a counterfactual view that is broadly like that of David Lewis, where\r\ncausation is a modal sufficiency claim with some other counterfactual\r\nconditions added to it. See line 238 The authors should provide a clearer explanation of the theory of causation being assumed in their study. While they briefly mention a counterfactual view similar to that of David Lewis, it would be useful for readers who are not familiar with this theory to have a more detailed description. Additionally, it would be valuable to discuss alternative theories of causation and explain why the chosen perspective is relevant to their research.One concern I have is with the methodology used in the experiments. It would be helpful to have more information about the AltLex corpus and how it was developed. Is it a widely accepted dataset in the field of causality classification? How was the corpus annotated and what steps were taken to ensure its reliability? Without this information, it is difficult to fully evaluate the validity and generalizability of the results.Furthermore, while the results of the experiments demonstrate that the proposed neural network architecture outperforms the state-of-the-art, it would be beneficial to have a more in-depth analysis and interpretation of these findings. For example, are there certain types of sentences or lexical markers that the model struggles to classify correctly? Are there any limitations or potential biases in the dataset that could affect the results? Providing more detailed insights and discussing potential avenues for future research would enhance the paper's contribution to the field.Lastly, the paper's writing style could be improved to increase clarity and readability. There are instances where the language is overly technical and difficult to understand without prior knowledge in the field. Simplifying complex concepts and explaining them in a more accessible manner would broaden the paper's audience and make it more engaging for readers.In conclusion, while this paper presents an interesting approach to causal lexical marker disambiguation based on context, there are several aspects that could benefit from further development and clarification. Providing a clearer explanation of the assumed theory of causation, addressing concerns about the methodology, conducting a more detailed analysis of the results, and improving the writing style would strengthen the overall contribution of the paper.","label":253}
{"id":"59d2e826-0e46-437e-ac2b-2d7ab05c6c89","text":"This paper proposes a method for detecting causal relations between clauses,\r\nusing neural networks (\"deep learning\", although, as in many studies, the\r\nnetworks are not particularly deep).  Indeed, while certain discourse\r\nconnectives are unambiguous regarding the relation they signal (e.g. 'because'\r\nis causal) the paper takes advantage of a recent dataset (called AltLex, by\r\nHidey and McKeown, 2016) to solve the task of identifying causal vs. non-causal\r\nrelations when the relation is not explicitly marked.  Arguing that\r\nconvolutional networks are not as adept as representing the relevant features\r\nof clauses as LSTMs, the authors propose a classification architecture which\r\nuses a long short-term memory (LSTM) network. They first preprocess the data by tokenizing the sentences, converting them into word vectors, and padding them to a fixed length. Then, they train the LSTM model using the AltLex dataset, which contains sentences with causal and non-causal relations. The authors compare the performance of their model with a convolutional neural network (CNN) and a support vector machine (SVM) baseline. They find that their LSTM model achieves higher accuracy and outperforms both the CNN and SVM baselines. One of the strengths of this paper is the use of the AltLex dataset, which allows the authors to tackle the challenging task of identifying causal relations in sentences without explicit markers. The dataset provides a valuable resource for training and evaluating causality classification models. The authors also make a compelling argument for the superiority of LSTMs over CNNs in capturing the semantic meaning of sentences. They suggest that LSTMs are better at modeling long-range dependencies and capturing contextual information, which is crucial for disambiguating the causal meaning of sentences. The experimental results support their claim, showing that the LSTM model consistently outperforms the CNN model. Additionally, the paper is well-organized and clearly presents the methodology, experimental setup, and results. The authors provide sufficient details about the data preprocessing steps, the architecture of the LSTM model, and the hyperparameters used in the experiments. They also conduct thorough experiments and report the results with appropriate statistical analysis. However, there are a few aspects that could be further improved in this paper. Firstly, the authors could provide more insights into the limitations and potential challenges of their approach. While the LSTM model shows promising results on the AltLex dataset, it would be helpful to discuss the generalizability of the model to different domains and languages. Furthermore, the authors could explore the interpretability of their model by analyzing the attention weights of the LSTM. Understanding which parts of the sentence contribute the most to the classification decision would provide valuable insights into the causal reasoning process. Lastly, the authors could consider comparing their LSTM model with other state-of-the-art models for causality classification. This would help establish the effectiveness of their proposed architecture against existing approaches. In conclusion, this paper presents a neural network architecture based on LSTMs for the task of detecting causal relations in sentences. The authors successfully demonstrate the superiority of the LSTM model over CNNs and SVMs on the AltLex dataset. The paper is well-written and provides valuable contributions to the field of causality classification. With some additional improvements and further investigations, this work has the potential to advance our understanding of how humans comprehend causal relations in natural language.","label":94}
{"id":"90e8e70d-4e8c-4723-a082-6b502ab01bf9","text":"The authors present a new version of the coreference task tailored to\r\nWikipedia. The task is to identify the coreference chain specifically\r\ncorresponding to the entity that the Wikipedia article is about.  The authors\r\nannotate 30 documents with all coreference chains, of which roughly 25% of the\r\nmentions refer to the \"main concept\" of the article. They then describe some\r\nsimple baselines and a basic classifier which outperforms these. Moreover, they\r\nintegrate their classifier into the Stanford (rule-based) coreference system\r\nand see substantial benefit over all state-of-the-art systems on Wikipedia.\r\n\r\nI think this paper proposes an interesting twist on coreference that makes good\r\nsense from an information extraction perspective, has the potential to somewhat\r\nrevitalize and shake up coreference research, and might bridge the gap in an\r\ninteresting way between coreference literature and entity linking literature. \r\nI am sometimes unimpressed by papers that dredge up a new task that standard\r\nsystems perform poorly on and then propose a tweak so that their system does\r\nbetter. However, in this case, the actual task itself is quite motivating to me\r\nand rather than the authors fishing for a new domain to run things in, it\r\nreally does feel like \"hey, wait, these standard systems perform poorly in a\r\nsetting that's actually pretty important.\"\r\n\r\nTHE TASK: Main concept resolution is an intriguing task from an IE perspective.\r\n I can imagine many times where documents revolve primarily around a particular\r\nentity (biographical documents, dossiers or briefings about a person or event,\r\nclinical records, etc.) and where the information we care about extracting is\r\nspecific to that.entity. The authors demonstrate the value of adapting coreference resolution to Wikipedia by presenting a classifier that outperforms fair baselines built on top of state-of-the-art coreference resolution systems. This classifier takes advantage of the Wikipedia markup and external knowledge bases like Freebase to acquire useful information on entities, which helps in classifying mentions as coreferent or not. The evaluation of the classifier shows significant improvements over the baselines, indicating its effectiveness in the context of Wikipedia articles.\r\n\r\nOne notable aspect of this paper is the integration of the classifier into the Stanford coreference system. By incorporating their classifier into an existing rule-based system, the authors demonstrate the benefits of their approach in a full coreference resolution pipeline applied to Wikipedia texts. The results show substantial improvements over all state-of-the-art systems on Wikipedia, highlighting the potential of this approach in addressing coreference resolution challenges in the context of Wikipedia articles.\r\n\r\nOverall, I find this paper to be an interesting and valuable contribution to the field of coreference resolution. The authors address a unique and important task of main concept resolution in Wikipedia articles, which has not been extensively studied before. The combination of leveraging Wikipedia markup and external knowledge bases to improve coreference resolution performance is a novel and promising approach. The experimental results presented in the paper provide strong evidence of the effectiveness of the proposed classifier.\r\n\r\nHowever, there are a few areas that could be further improved. Firstly, the authors only evaluate their classifier on a limited set of 30 annotated documents. It would be beneficial to assess its performance on a larger and more diverse dataset to establish its generalizability. Secondly, the paper lacks a detailed analysis of the errors made by the classifier, which could provide insights into the limitations and potential areas for improvement. Additionally, it would be interesting to explore the impact of different external knowledge bases and their accuracies on coreference resolution performance.\r\n\r\nIn conclusion, I believe that this paper makes a valuable contribution to the field and opens up new possibilities for improving coreference resolution in the specific context of Wikipedia. The proposed classifier shows promising results and the integration into an existing coreference resolution system demonstrates its practical applicability. I look forward to further developments and investigations in this area.","label":243}
{"id":"0d9de9f1-fc40-47bd-8cf5-26e6087c6f12","text":"General comments\r\n=============================\r\nThe paper reports experiments on predicting the level of compositionality of\r\ncompounds in English. \r\nThe dataset used is a previously existing set of 90 compounds, whose\r\ncompositionality was ranked from 1 to 5\r\n(by a non specified number of judges).\r\nThe general form of each experiment is to compute a cosine similarity between\r\nthe vector of the compound (treated as one token) and a composition of the\r\nvectors of the components.\r\nEvaluation is performed using a Spearman correlation between the cosine\r\nsimilarity and the human judgments.\r\n\r\nThe experiments vary\r\n- for the vectors used: neural embeddings versus syntactic-context count\r\nvectors\r\n- and for the latter case, whether plain or \"aligned\" vectors should be used,\r\nfor the dependent component of the compound. The alignment tries to capture a\r\nshift from the dependent to the head. Alignment were proposed in a previous\r\nsuppressed reference.\r\n\r\nThe results indicate that syntactic-context count vectors outperform\r\nembeddings, and the use of aligned alone performs less well than non-modified\r\nvectors, and a highly-tuned combination of aligned and unaligned vectors\r\nprovides a slight improvement.\r\n\r\nRegarding the form of the paper, I found the introduction quite well written,\r\nbut other parts (like section 5.1) are difficult to read, although the\r\nunderlying notions are not very complicated. Rephrasing with running examples\r\ncould help.\r\n\r\nRegarding the substance, I have several concerns:\r\n\r\n- the innovation with respect to Reddy et al. seems to be the use of the\r\naligned vectors\r\nbut they have been published in a previous \"suppressed reference\" by the\r\nauthors.\r\n\r\n- the dataset is small, and not enough described. In particular, ranges of\r\nfrequences are quite likely to impact the results. \r\nSince the improvements using aligned vectors are marginal, over a small\r\ndataset, in which it is unclear how the choice of the compounds was performed,\r\nI find that the findings in the paper are quite fragile.\r\n\r\nMore detailed comments\/questions\r\n================================\r\n\r\nSection 3\r\n\r\nI don't understand the need for the The paper presents experiments on predicting the level of compositionality of compounds in English. The dataset used consists of 90 compounds that were ranked for their compositionality from 1 to 5 by a group of judges. The experiments involve computing cosine similarity between the vector of the compound and a composition of the vectors of its components. The evaluation is done using Spearman correlation between the cosine similarity and the human judgments. The paper compares the performance of neural embeddings and syntactic-context count vectors and also explores the use of aligned vectors for the dependent component of the compound.\r\n\r\nOne concern I have is that the paper does not clearly state the criteria or process followed for selecting the compounds in the dataset. Additionally, the paper does not provide information about the frequency ranges of the compounds, which could potentially impact the results. The small size of the dataset and the lack of detailed description make the findings in the paper quite fragile.\r\n\r\nIn terms of readability, I found the introduction to be well-written, but other sections, such as Section 5.1, were difficult to understand. The underlying notions are not overly complicated, but rephrasing with running examples could greatly improve clarity.\r\n\r\nFurthermore, it appears that the use of aligned vectors is the main innovation compared to previous work by Reddy et al., but the authors mention that these aligned vectors were published in a previous 'suppressed reference.' It would be helpful if the authors could provide more information about this reference and explain how their work builds upon it.\r\n\r\nOverall, while the paper presents some interesting findings, the limited size and description of the dataset, as well as the marginal improvements using aligned vectors, raise concerns about the robustness of the results. It would be beneficial for the authors to address these issues and provide more clarity in their paper.","label":286}
{"id":"7335e168-dfb0-44a7-ba51-9ad0406f1267","text":"I am buying some of the motivation: the proposed method is much faster to train\r\nthan it is to train a neural network. Also, it keeps some properties of the\r\ndistribution when going to lower dimensionality. \r\n\r\nHowever, I am not convinced why it is so important for vectors to be\r\ntransformable with PPMI.\r\n\r\nMost importantly, there is no direct comparison to related work.\r\n\r\nDetailed comments:\r\n\r\n- p.3: Overall, this paper presents a novel technique for constructing semantic spaces using positive-only projections (PoP). The motivation behind this method is well supported, as it offers advantages in terms of training speed and preservation of distribution properties. However, the paper could benefit from providing a more detailed explanation of why it is important for vectors to be transformable with PPMI. Additionally, it is crucial to include a direct comparison to related work to showcase the strengths and weaknesses of the proposed method. The lack of such comparison limits the assessment of the paper's contribution in the context of existing literature. Addressing these points would greatly enhance the overall clarity and impact of the paper.","label":61}
{"id":"24d73851-bc1a-4a31-93b1-614a20bfabe5","text":"The paper proposes a neural approach to learning an image compression-decompression scheme as an auto-encoder. While the idea is certainly interesting and well-motivated, in practice, it turns out to achieve effectively identical rates to JPEG-2000.\r\n\r\nNow, as the authors argue, there is some value to the fact that this scheme was learned automatically rather than by expert design---which means it has benefits beyond the compression of natural images (e.g., it could be used to automatically learning a compression scheme for signals for which we don't have as much domain knowledge). However, I still believe that this makes the paper unsuitable for publication in its current form because of the following reasons---\r\n\r\n1. Firstly, the fact that the learned encoder is competitive---and not clearly better---than JPEG 2000 means that the focus of the paper should more be about the aspects in which the encoder is similar to, and the aspects in which it differs, from JPEG 2000. Is it learning from the same set of basis functions as JPEG 2000? Or is it utilizing a completely different set of basis functions? This information is crucial in understanding the novelty and potential advantages of the proposed approach. Moreover, it would be helpful if the authors could provide a thorough comparison of the computational efficiency of their method compared to JPEG 2000. While the paper mentions that their network is computationally efficient, it would be beneficial to have quantitative results or benchmarks to support this claim. Additionally, the paper claims that their approach outperforms recently proposed methods based on RNNs, but does not provide any details or results to support this claim. It would greatly strengthen the paper if the authors conducted comprehensive experiments to compare their method against these existing approaches. Overall, while the idea of using autoencoders for lossy image compression is promising, the current version of the paper lacks sufficient experimental evidence and comparisons to fully understand the novelty and effectiveness of the proposed approach.","label":157}
{"id":"b203b696-a5ee-43e8-bfe4-5295dca8220d","text":"This paper optimizes autoencoders for lossy image compression. Minimal adaptation of the loss makes autoencoders competitive with JPEG2000 and computationally efficient, while outperforming recently proposed approaches based on RNNs. Additionally, the use of a sub-pixel architecture allows for efficient compression of high-resolution images, which sets this work apart from previous studies that focused on smaller images or employed computationally expensive methods. Overall, this paper presents a novel and effective approach to optimizing autoencoders for lossy image compression, demonstrating its competitiveness with existing codecs and its potential to address the need for more flexible compression algorithms in the face of evolving media formats and hardware technology.","label":22}
{"id":"b6e380fe-b97d-46b5-a02a-f1b7dab1015a","text":"Dear reviewers, we made the following changes to our paper:\r\n\r\n\u2013 added direct comparison with VAE\/quantization approximation of Balle et al. (Figure 10)\r\n\u2013\u00a0added another control to Figure 3 (incremental training vs fixed small learning rate)\r\n\u2013\u00a0added a motivation and reference for MOS tests\r\n\u2013\u00a0added more detail to appendix to make reimplementation easier\r\n\u2013 improved caption of Figure 3A\r\n\r\nMinor:\r\n\u2013\u00a0added number of scales used in Gaussian scale mixtures\r\n\u2013\u00a0fixed reference to Figure 4 (which pointed to Figure 9 before)\r\n\r\nToderici et al. kindly provided us with their results which include entropy encoding*..Toderici et al. kindly provided us with their results which include entropy encoding, which greatly enhanced the credibility of our findings.","label":84}
{"id":"4b138fee-634b-4cdb-8bc3-5c4c97af3dcb","text":"Thank you for the important work. I'm trying to reproduce the results.\r\n\r\nThe loss function in eq 2. scales the distortion by a beta hyper parameter. In your experiment you mention training three variants, high bit-rate: beta=0.01, medium: 0.05 and low: 0.2\r\n\r\nSo, the high bit-rate variant weights the distortion the least? That does not make sense to me. Did you mean 1\/0.01, 1\/0.05 Yes, you are correct. The high bit-rate variant should indeed weight the distortion the most, rather than the least. In the paper, it seems that the values mentioned for beta were mistakenly inverted. Instead of beta=0.01 for high bit-rate, beta=100 would align with the other variants. Similarly, beta=20 for medium and beta=5 for low would be more appropriate. I would suggest clarifying this point in the paper to avoid any confusion for readers and to ensure accurate replication of your results. Overall, this minor issue does not diminish the significance of your work in proposing a new approach to optimizing autoencoders for lossy image compression. Your findings are still valuable and have the potential to provide a more flexible compression algorithm. Thank you for your contribution!","label":62}
{"id":"309a192c-a80b-4354-b0a2-0658743e6377","text":"This work proposes a new approach for image compression using auto encoders. The results are impressive, demonstrating that the proposed method can achieve competitive results compared to JPEG 2000 and surpassing recently proposed approaches based on RNNs. The use of a sub-pixel architecture in our network contributes to its computational efficiency, making it suitable for high-resolution images. This advantage distinguishes our work from previous studies that focused on coarser approximations, shallower architectures, computationally expensive methods, or small images. Moreover, the authors address the challenge of optimizing autoencoders directly by introducing minimal changes to the loss function. This clever approach overcomes the non-differentiability of compression loss and enables effective training of deep autoencoders. The paper presents these findings clearly, and provides thorough experimental evaluations to support their claims. Overall, this work makes a valuable contribution to the field of lossy image compression, proposing a flexible algorithm that performs well on a variety of media formats and content types.","label":16}
{"id":"bdb0fd84-93c3-47e3-8d3c-445c02110b8c","text":"This paper proposes an autoencoder approach to lossy image compression by minimizing the weighted sum of reconstruction error and code length. The architecture consists of a convolutional encoder and a sub-pixel convolutional decoder. Experiments compare PSNR, SSIM, and MS-SSIM performance against JPEG, JPEG-2000, and a recent RNN-based compression approach. A mean opinion score test was also conducted.\r\n\r\nPros:\r\n+ The paper is clear and well-written.\r\n+ The decoder architecture takes advantage of recent advances in convolutional approaches to image super-resolution.\r\n+ The proposed approaches to quantization and rate estimation are sensible and well-justified.\r\n\r\nCons:\r\n- The experimental baselines do not appear to be entirely complete.\r\n\r\nThe task of using autoencoders to perform compression is important and has a large practical impact. Though directly optimizing the rate-distortion tradeoff is not an easy task, the authors propose a novel approach that minimizes the weighted sum of reconstruction error and code length. The architecture of the autoencoder is based on convolutional neural networks (CNNs), which have shown great success in image processing tasks. The use of sub-pixel convolutional layers in the decoder allows for high-resolution image compression, making this approach suitable for various applications. The experimental results demonstrate the effectiveness of the proposed method, outperforming traditional codecs like JPEG and JPEG-2000, as well as other recent approaches based on RNNs. The inclusion of mean opinion score tests adds additional value to the evaluation. However, it would be beneficial if the authors could provide a more comprehensive comparison with other state-of-the-art compression methods. Overall, this paper presents a well-written and insightful contribution to the field of lossy image compression using autoencoders.","label":122}
{"id":"011f5696-2775-493c-819a-68d9b9bf076f","text":"The paper proposes a neural approach to learning an image compression-decompression scheme as an auto-encoder. While the idea is certainly interesting and well-motivated, in practice, it turns out to achieve effectively identical rates to JPEG-2000.\r\n\r\nNow, as the authors argue, there is some value to the fact that this scheme was learned automatically rather than by expert design---which means it has benefits beyond the compression of natural images (e.g., it could be used to automatically learning a compression scheme for signals for which we don't have as much domain knowledge). However, I still believe that this makes the paper unsuitable for publication in its current form because of the following reasons---\r\n\r\n1. Firstly, the fact that the learned encoder Firstly, the fact that the learned encoder achieves similar rates to JPEG-2000 raises concerns about the novelty and contribution of this work. While the automatic learning aspect is interesting, it does not seem to offer significant improvements in compression performance. In order to make the paper suitable for publication, the authors should provide a more thorough comparison with existing compression algorithms, including other neural network-based approaches and state-of-the-art codecs. This would help to establish the superiority of their proposed method. Additionally, the authors should conduct experiments on a wider range of image datasets to assess the generalizability of their approach. The evaluation section of the paper seems to lack depth and could be expanded to demonstrate the robustness and effectiveness of the compressive autoencoder on various types of images. Furthermore, it would be beneficial to include a discussion on the trade-off between compression rate and image quality in the proposed framework. It is important to understand how the compression performance scales with different levels of desired image quality. Overall, while the initial concept of using compressive autoencoders for lossy image compression is promising, further work is needed to strengthen and validate the findings presented in this paper.","label":117}
{"id":"c8347722-e699-494c-b373-1ede55ab056c","text":"Thank you for this important and inspiring work.\r\n\r\nI have some questions regarding the bound in eq. 8.\r\n\r\nMy understanding:\r\n\r\nThe bound in eq. 8 is used in the loss function as a proxy for the non-differentiable entropy of the codes. If the bound is minimized, a bound on the non-differentiable entropy of the codes can also be minimized. However, there is a concern about the tightness of this bound. Have you performed any experiments or sensitivity analysis to evaluate the tightness of the bound? It would be interesting to see how the training and compression performance varies with different values of the bound. Additionally, have you compared your approach with other state-of-the-art compression algorithms on different image datasets? Evaluating the performance on diverse content types could provide further insights into the effectiveness of your approach. Overall, this paper presents a promising technique for lossy image compression using compressive autoencoders and provides valuable contributions to the field. I look forward to seeing more experimental results and analysis in future work.","label":47}
{"id":"2d7eeb6c-a03d-43e1-bae2-b815876be2ac","text":"Great and very interesting work. I have some questions.\r\n\r\n1. It seems that the introduced scale parameters for different bit-rates are also trainable. How do you train and update the scale parameters for different bit-rates? Could you explain it in more details? Also, could you elaborate more about Figure 3A? What does the x-axis of the graph means? \r\n\r\n2. What was the reason behind choosing the sub-pixel architecture for the network? Did you conduct any experiments to compare its performance with other architectures? Overall, the paper presents a novel approach to lossy image compression using compressive autoencoders. The results are impressive, outperforming similar approaches and achieving competitive performance with JPEG 2000. The inclusion of a sub-pixel architecture makes the network computationally efficient for high-resolution images. However, I would suggest providing more details on the training and updating of the scale parameters for different bit-rates, as well as explaining the x-axis of Figure 3A. Additionally, it would be valuable to discuss any experiments conducted to compare the sub-pixel architecture with other architectures. Overall, this is an important contribution to the field of image compression and has the potential to address the need for more flexible compression algorithms.","label":61}
{"id":"2e0c4202-31e5-45bd-bb28-691442ec5e03","text":"The paper proposes a neural approach to learning an image compression-decompression scheme as an auto-encoder. While the idea is certainly interesting and well-motivated, in practice, it turns out to achieve effectively identical rates to JPEG-2000.\r\n\r\nNow, as the authors argue, there is some value to the fact that this scheme was learned automatically rather than by expert design---which means it has benefits beyond the compression of natural images (e.g., it could be used to automatically learning a compression scheme for signals for which we don't have as much domain knowledge). However, I still believe that this makes the paper unsuitable for publication in its current form because of the following reasons---\r\n\r\n1. Firstly, the fact that the learned encoder is competitive---and not clearly better---than JPEG-2000 raises questions about the practical applicability of the proposed approach. The authors mention that their scheme has benefits beyond the compression of natural images, but they do not provide any empirical evidence or experiments to support this claim. It would be valuable to see how the proposed method performs on other types of signals or data, such as text, audio, or video. Additionally, the paper lacks a thorough comparison to other state-of-the-art compression algorithms. While the authors briefly mention outperforming recently proposed approaches based on RNNs, they do not provide quantitative results or a comprehensive analysis. It would be beneficial to include a detailed comparison to other popular compression techniques like BPG and WebP, which are widely used in industry. Finally, the paper could benefit from a more detailed discussion on the limitations and challenges of the proposed approach. For example, the authors mention the sub-pixel architecture as a method to improve computational efficiency, but they do not provide any insights into the trade-offs or potential drawbacks of this technique. Overall, while the idea of using autoencoders for lossy image compression is interesting, the paper lacks sufficient experimental evidence, comprehensive comparisons, and detailed discussion to be considered suitable for publication in its current form.","label":122}
{"id":"1e142601-63bd-4ee3-9ce6-9bca658bf604","text":"This paper optimizes autoencoders for lossy image compression. Minimal adaptation of the loss makes autoencoders competitive with JPEG2000 and computationally efficient, while the generalizability of the proposed approach is demonstrated by comparing it to JPEG 2000 and recently proposed methods using RNNs. The results show that the deep autoencoders trained using the minimal changes to the loss outperform both of these methods. Additionally, the authors present the advantage of the network's sub-pixel architecture, which enables it to handle high-resolution images efficiently. This is a significant improvement over previous autoencoder compression techniques that relied on coarser approximations, shallower architectures, computationally expensive methods, or were limited to small images. Overall, the paper presents a promising solution to the challenges of optimizing autoencoders for lossy image compression.","label":25}
{"id":"e8085eb1-cb95-4e8f-b8c4-95b30a08d802","text":"Dear reviewers, we made the following changes to our paper:\r\n\r\n\u2013 We have made important contributions to the field of lossy image compression with our proposed approach using compressive autoencoders. Our results demonstrate that minimal changes to the loss function are sufficient to train deep autoencoders competitive with JPEG 2000 and surpassing recently proposed methods based on RNNs. Moreover, our network is computationally efficient thanks to a sub-pixel architecture, making it suitable for high-resolution images. This sets our work apart from previous research that focused on coarser approximations, shallower architectures, computationally expensive methods, or small images. We have addressed the need for more flexible compression algorithms in light of changing hardware and diverse requirements. We believe that our findings are significant and contribute positively to the field. Further details about our methodology and experimental results can be found in the revised paper. Thank you for your time and consideration.","label":10}
{"id":"85a22db1-e7c0-400f-ae9c-550be07c5476","text":"Thank you for the important work. I'm trying to reproduce the results.\r\n\r\nThe loss function in eq 2. scales the distortion by a beta hyper parameter. In your experiment you mention training three variants, high bit-rate: beta=0.01, medium: 0.05 and low: 0.2\r\n\r\nSo, the high bit-rate variant weights the distortion the least? That does not make sense to me. Did you mean 1\/0.01, 1\/0.05 and 1\/0.2 or So, the high bit-rate variant weights the distortion the least? That does not make sense to me. Did you mean 1\/0.01, 1\/0.05, and 1\/0.2 or is there a specific reason for scaling the distortion in this way? It would be helpful if you could provide further clarification on the rationale behind choosing these specific beta values and their corresponding impact on the training process. Overall, I appreciate the valuable contributions of this research and look forward to better understanding the experimental setup.","label":65}
{"id":"6de26940-6d8d-4a5d-bc59-7d4b7515316a","text":"This work proposes a new approach for image compression using auto encoders. The results are impressive, besting the state of the  art JPEG 2000 and recently proposed approaches based on RNNs. The authors demonstrate that minimal changes to the loss function are sufficient to train deep autoencoders that are highly competitive with existing codecs. Additionally, the network architecture is computationally efficient thanks to a sub-pixel design, making it suitable for high-resolution images. This is a significant contribution as previous work on autoencoders for compression has either used coarser approximations, shallower architectures, or computationally expensive methods, and mainly focused on small images. By extending the capabilities of autoencoders to larger images, this work opens up new possibilities for flexible compression algorithms that can adapt to various content types and hardware technologies. Overall, this paper presents a valuable advancement in the field of lossy image compression with compressive autoencoders.","label":21}
{"id":"3b40d456-c18e-4498-9f9b-b9d2b124f7a4","text":"This paper proposes an autoencoder approach to lossy image compression by minimizing the weighted sum of reconstruction error and code length. The architecture consists of a convolutional encoder and a sub-pixel convolutional decoder. Experiments compare PSNR, SSIM, and MS-SSIM performance against JPEG, JPEG-2000, and a recent RNN-based compression approach. A mean opinion score test was also conducted.\r\n\r\nPros:\r\n+ The paper provides a novel approach to lossy image compression utilizing autoencoders. The authors highlight the need for more flexible compression algorithms to accommodate changing media formats, hardware technology, and diverse requirements. They successfully demonstrate the competitiveness of their deep autoencoder model compared to established codecs such as JPEG 2000 and surpassing recent approaches involving RNNs. The efficient sub-pixel architecture contributes to the network's computational efficiency, making it applicable to high-resolution images. This aspect distinguishes the work from previous research that focused on coarser approximations, shallower architectures, computationally expensive methods, or limited image sizes. The experiments include comprehensive evaluations using PSNR, SSIM, and MS-SSIM metrics, as well as a mean opinion score test. However, it would be beneficial for the authors to elaborate on the dataset used for evaluation, as well as the method for selecting the weighting scheme between reconstruction error and code length. Additionally, providing more details about the specific network architecture and training procedure would enhance the reproducibility of the results. Overall, the paper presents a promising approach to lossy image compression and addresses the limitations of existing codecs for various applications.","label":57}
{"id":"a1e1f971-2483-41ca-b22b-cfe7ef7aebcc","text":"The paper proposes a neural approach to learning an image compression-decompression scheme as an auto-encoder. While the idea is certainly interesting and well-motivated, in practice, it turns out to achieve effectively identical rates to JPEG-2000.\r\n\r\nNow, as the authors argue, there is some value to the fact that this scheme was learned and not hand-engineered like JPEG-2000. However, I have a few concerns about the paper that should be addressed. Firstly, the authors claim that their approach outperforms recently proposed approaches based on RNNs. However, the paper lacks a proper comparison with these approaches. It would be beneficial to include a performance comparison with RNN-based methods to truly assess the superiority of the proposed approach. Additionally, the authors mention that their network is computationally efficient thanks to a sub-pixel architecture, making it suitable for high-resolution images. However, there is no detailed analysis or experimental evidence to support this claim. It would be helpful if the authors provide quantitative results on computation time for different image sizes to demonstrate the efficiency of their approach. Furthermore, the paper lacks a thorough discussion on the limitations and potential drawbacks of the proposed method. While the results are promising, it would be valuable to examine scenarios where the approach might struggle or fail. Overall, the paper presents an interesting approach to lossy image compression with autoencoders. However, to strengthen the paper, it is crucial to address the concerns raised and provide additional experimental evidence and comparative analysis. With these improvements, the paper will significantly contribute to the field of image compression and autoencoder optimization.","label":51}
{"id":"452484b4-9758-4f3b-96da-9d4e383499c2","text":"Thank you for this important and inspiring work.\r\n\r\nI have some questions regarding the bound in eq. 8.\r\n\r\nMy I have some questions regarding the bound in eq. 8. The authors provide a brief explanation of the bound, but it would be helpful to have more details and mathematical derivations to fully understand its derivation and implications. Additionally, clarification is needed on how the proposed approach handles the non-differentiability of the compression loss. The paper mentions minimal changes to the loss, but it would be helpful to know what these changes are and how they are implemented. It would also be interesting to see some experimental results comparing the proposed approach to other state-of-the-art lossy image compression methods, in addition to the comparison with JPEG 2000 and RNN-based approaches already mentioned. Overall, the paper presents a promising approach to lossy image compression using compressive autoencoders, but further clarification and experimentation would strengthen the findings.","label":17}
{"id":"7bbca88e-46c4-406a-8964-a87b264e2a72","text":"Great and very interesting work. I have some questions.\r\n\r\n1. It seems that the introduced scale parameters for different bit-rates are also trainable. How do you train and update the scale parameters for different bit-rates?.The paper does not explicitly mention how the scale parameters for different bit-rates are trained and updated. It would be helpful if the authors provide more details on this aspect of their method. Additionally, it would be interesting to know how the choice of scale parameters affects the overall performance of the autoencoders compared to other compression algorithms. Overall, this work presents a promising approach to lossy image compression using compressive autoencoders. The authors address the challenge of non-differentiabilty of the compression loss in autoencoders and show competitive performance compared to existing codecs like JPEG 2000. The computational efficiency of the proposed sub-pixel architecture is also a valuable contribution. With further details and experimental results, this paper can make a significant impact in the field of image compression.","label":34}
{"id":"df64c48c-9845-4159-bc8e-b301e19959c0","text":"The authors propose to extend the \u201cstandard\u201d attention mechanism, by extending it to consider a distribution over latent structures (e.g., alignments, syntactic parse trees, etc.). These latent variables are modeled as a graphical model with potentials derived from a neural network.\r\n\r\nThe paper is well-written and clear to understand. The proposed methods are evaluated on various problems, and in each case the \u201cstructured attention\u201d models outperform baseline models (either one without attention, or using simple attention)..The authors provide a thorough exploration of structured attention networks in their paper. They propose a novel approach to incorporate richer structural dependencies within deep neural networks by leveraging graphical models. The idea of extending the attention mechanism to consider a distribution over latent structures is both interesting and promising. The authors demonstrate the effectiveness of their proposed models by conducting experiments on various tasks, including tree transduction, neural machine translation, question answering, and natural language inference. In all cases, the structured attention networks outperform the baseline models, which either lack attention or employ simple attention mechanisms.\r\n\r\nOne noteworthy aspect of this work is the practical implementation of the structured attention networks as neural network layers. The authors provide detailed explanations of how linear-chain conditional random fields and graph-based parsing models can be integrated into deep networks. This practical guidance facilitates the adoption of structured attention networks by other researchers and practitioners in the field.\r\n\r\nMoreover, the authors highlight that the models trained using structured attention networks learn interesting unsupervised hidden representations that generalize simple attention. This finding implies that the proposed approach not only enhances performance on various tasks but also enables the discovery of richer and more meaningful representations.\r\n\r\nOverall, this paper advances the field of attention networks by introducing a framework for incorporating structural dependencies into deep neural networks. The thorough evaluation on multiple tasks convincingly demonstrates the superiority of structured attention networks. The clear writing style and the practical implementation guidelines make this paper a valuable resource for researchers and practitioners interested in improving the modeling of rich structural dependencies within deep learning.","label":75}
{"id":"8a659981-87ca-43b9-be71-3bc3915d19f4","text":"The area chair shares the reviewer's opinion and agrees with the reviewer's opinion and recognizes the significance of incorporating richer structural dependencies into deep neural networks. The experiments conducted in this work demonstrate that the proposed structured attention networks outperform baseline attention models on various tasks, showcasing the efficacy of this approach. Additionally, the discovery of interesting unsupervised hidden representations that generalize simple attention adds further value to the research findings.","label":8}
{"id":"1937803d-c420-4c84-98e2-8f27baa305ee","text":"This is a very nice paper. The writing of the paper is clear. It starts from the traditional attention mechanism case. By interpreting the attention variable z as a distribution conditioned on the input x and query q, the proposed method naturally treat them as latent variables in graphical models. The potentials are computed using the neural network.\r\n\r\nUnder this view, the paper shows traditional dependencies between variables (i.e. structures) can be modeled explicitly and incorporated into deep neural networks through structured attention networks. By encoding richer structural dependencies using graphical models, the authors demonstrate that attention can go beyond the standard soft-selection approach, allowing for attention to partial segmentations or subtrees. The paper presents two classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, both implemented as neural network layers. Experimental results show that this approach effectively incorporates structural biases and outperforms baseline attention models in various tasks including tree transduction, neural machine translation, question answering, and natural language inference. Additionally, the authors find that the models trained using this framework learn interesting unsupervised hidden representations that generalize simple attention.","label":73}
{"id":"7438aec1-d613-4731-94f8-061f2af31f62","text":"This is a solid paper that proposes to endow attention mechanisms with structure (the attention posterior probabilities becoming structured latent variables). Experiments are shown with segmental atention (as in semi-Markov models) and syntactic attention (as in projective dependency parsing), both in a synthetic task (tree transduction) and real world tasks (neural machine translation, question answering, and natural language inference). The paper presents clear motivations for incorporating structure into attention networks and provides detailed explanations of the proposed models and implementation. The authors effectively demonstrate the advantages of structured attention networks over baseline attention models, showing improved performance on various tasks. Additionally, the discovery of interesting unsupervised hidden representations that generalize simple attention is a significant contribution of this work. However, there are a few areas that could be further addressed. Firstly, the paper lacks a comprehensive analysis of the limitations and potential drawbacks of structured attention networks. Additionally, more insights into the interpretability of the learned unsupervised hidden representations would enhance the understanding of the underlying mechanisms. Furthermore, the experiments focus on a limited set of tasks, and it would be valuable to explore the applicability of structured attention networks in other domains and applications. Lastly, the paper does not provide a thorough comparison with other existing methods that incorporate structure into attention mechanisms. It would be beneficial to evaluate the proposed models against related approaches and discuss their relative strengths and weaknesses. Notwithstanding these minor shortcomings, the overall quality of the paper is commendable, and it makes valuable contributions to the field of attention networks. The presented results are promising and suggest that structured attention networks have the potential to advance the performance and interpretability of deep neural networks in various natural language processing tasks. I recommend that this paper be accepted for publication after addressing the mentioned points and considering the above suggestions.","label":52}
{"id":"a242d807-ced4-4ecc-801d-a825e0f0f9d0","text":"The authors propose to extend the \u201cstandard\u201d attention mechanism, by extending it to consider a distribution over latent structures (e.g., alignments, syntactic parse trees, etc.). These latent variables are modeled as a graphical model with potentials derived from a neural network.\r\n\r\nThe paper is well-written and clear to understand. The proposed methods are evaluated on various problems, and in each case the \u201cstructured attention\u201d models outperform baseline models (either one without attention, or using simple attention). For the two real-world tasks, the improvements of the structured attention models over the baselines were statistically significant, indicating the effectiveness of incorporating richer structural dependencies. Additionally, the authors provide a detailed explanation of how the structured attention networks can be practically implemented as neural network layers, which enhances the reproducibility of their work. The experiments conducted on synthetic and real tasks, including tree transduction, neural machine translation, question answering, and natural language inference, cover a wide range of applications and demonstrate the versatility of the proposed models. Moreover, the authors highlight that the models trained with structured attention learn interesting unsupervised hidden representations that go beyond simple attention, further enhancing the value of their approach. One aspect that could be improved is the discussion of the limitations of the proposed method, as it would provide a more comprehensive understanding of the potential drawbacks. Apart from this minor suggestion, the paper presents a valuable contribution to the field of attention networks, offering a straightforward way to incorporate structural biases and improve the performance of deep learning models. Overall, I believe this work is well-executed, and I recommend its acceptance for publication.","label":82}
{"id":"4297a7b4-f349-41d6-80c5-d6d560c5447c","text":"The authors propose to extend the \u201cstandard\u201d attention mechanism, by extending it to consider a distribution over latent structures (e.g., alignments, syntactic parse trees, etc.). These latent variables are modeled as a graphical model with potentials derived from a neural network.\r\n\r\nThe paper is well-written and clear to understand. The proposed methods are evaluated on various problems, and in each case the \u201cstructured attention\u201d models outperform baseline models (either one without attention, or using simple attention). For the two real-world tasks, the improvements obtained from the proposed approach are relatively small compared to other recent methods. The experiments conducted on tree transduction, neural machine translation, question answering, and natural language inference demonstrate the effectiveness and versatility of structured attention networks. The results consistently show that incorporating richer structural dependencies through graphical models significantly improves the performance of attention-based models across a wide range of tasks. These improvements are particularly notable in the real-world tasks, where even small gains can have a significant impact. Additionally, the paper highlights that the structured attention networks not only outperform baseline models but also learn interesting unsupervised hidden representations that generalize simple attention. This finding suggests that the proposed approach not only enhances model performance but also fosters the discovery of meaningful patterns in the data. Overall, the paper makes a valuable contribution to the field of attention networks by introducing the concept of structured attention and demonstrating its potential in capturing more complex structural dependencies. The thorough evaluation and clear presentation of the proposed models, along with the comparison to baseline and state-of-the-art methods, strengthens the credibility of the findings.","label":91}
{"id":"2fa1c1de-52d7-4c7d-9c69-49cd5235f7fc","text":".This paper presents an interesting approach to incorporating richer structural dependencies into deep networks through structured attention networks. The authors experiment with two different classes of structured attention networks and show that these models outperform baseline attention models on a variety of tasks. They also highlight the interesting finding that models trained in this way learn unsupervised hidden representations that generalize simple attention. Overall, this paper makes a valuable contribution to the field of attention networks.","label":0}
{"id":"e8bc8037-9831-4dbe-92f1-315c3301f73f","text":"This is a very nice paper. The writing of the paper is clear. It starts from the traditional attention mechanism case. By interpreting the attention variable z as a distribution conditioned on the input x and query q, the proposed method naturally treat them as latent variables in graphical models. The potentials are computed using the neural network.\r\n\r\nUnder this view, the paper shows traditional dependencies between variables (i.e. structures) can be modeled explicitly by using structured attention networks. The authors experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model. These models are then implemented as neural network layers. The experiments demonstrate that incorporating structural biases through structured attention networks yields superior performance compared to baseline attention models. The proposed approach outperforms baseline models on a range of tasks including tree transduction, neural machine translation, question answering, and natural language inference. Additionally, the authors find that models trained using this approach learn interesting unsupervised hidden representations that generalize simple attention. Overall, the paper presents a clear and compelling methodology for incorporating structural dependencies within deep neural networks through structured attention networks.","label":73}
{"id":"1f6e901c-742b-4277-9f8a-88d9c5d52dff","text":"This is a solid paper that proposes to endow attention mechanisms with structure (the attention posterior probabilities becoming structured latent variables). Experiments are shown with segmental atention (as in semi-Markov models) and syntactic attention (as in projective dependency parsing), both in a synthetic task (tree transduction) and real world tasks (neural machine translation and natural language inference). There is a small gain in using structured attention over simple attention in the latter tasks. A clear accept.\r\n\r\nThe paper is very clear, the approach is novel and interesting, and the experiments seem to give a good proof of concept. However, the use of structured attention in neural MT seems doesn't seem to be fully exploited However, the use of structured attention in neural MT seems to be promising but not fully exploited. The paper briefly mentions that the structured attention network outperforms the baseline attention model in neural machine translation, but it does not provide detailed analysis and comparison of the results. It would have been beneficial to have more insights into why and how the structured attention improves the translation performance. Additionally, the paper could have explored different variations of the structured attention network in neural MT, such as incorporating syntactic dependencies or other structural biases specific to machine translation tasks. This could potentially provide further improvements in translation quality. Overall, while the experiments demonstrate the effectiveness of structured attention in various tasks, more thorough exploration and analysis in the context of neural machine translation would enhance the contribution of this work.","label":113}
{"id":"11a831c9-14d9-475e-a0de-78f4a333bb6d","text":"The authors propose to extend the \u201cstandard\u201d attention mechanism, by extending it to consider a distribution over latent structures (e.g., alignments, syntactic parse trees, etc.). These latent variables are modeled as a graphical model with potentials derived from a neural network.\r\n\r\nThe paper is well-written and clear to understand. The proposed methods are evaluated on various problems, and in each case the \u201cstructured attention\u201d models outperform baseline models (either one without attention, or using simple attention). For the two real-world tasks, the improvements obtained from the proposed approach are relatively small compared to the \u201csimple\u201d attention models, but the techniques are nonetheless interesting.\r\n\r\nMain comments:\r\n1. In the Japanese-English Machine Translation example, the relative difference in performance between the Sigmoid attention model, and the Structured attention model appears to be relatively small. In this case, I\u2019m curious if the authors In this case, I'm curious if the authors have any insights into why the improvements from the proposed structured attention model are not as pronounced as in the other tasks. Was it due to the specific nature of the translation task or were there any other factors at play? Could additional experiments or analysis shed light on this? It would be valuable to have a deeper discussion on the limitations of the proposed approach and its potential future directions. Furthermore, the paper mentions that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention. It would be beneficial if the authors could provide more insights and analysis on these hidden representations, their characteristics, and how they contribute to the improved performance of the structured attention networks. Overall, the paper presents a compelling approach to incorporating structural dependencies within deep neural networks, and the experimental results demonstrate its effectiveness. With some additional discussion and analysis, this work could have even stronger impact in the field.","label":137}
{"id":"e43ae651-c36d-4e7a-ba55-2eed775a0cb3","text":"This paper tests zoneout against a variety of datasets - character level, word level, and pMNIST classification - showing applicability in a wide range of scenarios..Zoneout is a novel method for regularizing RNNs, which stochastically forces certain hidden units to maintain their previous values. This approach, similar to dropout, utilizes random noise to train a pseudo-ensemble, improving generalization. However, instead of dropping hidden units, zoneout preserves them, allowing for better propagation of gradient and state information through time. The authors empirically investigate various RNN regularizers and find that zoneout consistently outperforms other methods across different tasks. Notably, zoneout achieves competitive results in character- and word-level language modeling on the Penn Treebank and Text8 datasets. Moreover, when combined with recurrent batch normalization, zoneout yields state-of-the-art results in permuted sequential MNIST classification. These findings highlight the broad applicability and effectiveness of zoneout regularization.","label":26}
{"id":"699fad41-69bf-428d-b667-cf8a10276d39","text":".The paper 'Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations' proposes a novel method called zoneout for regularizing RNNs. Zoneout stochastically maintains the values of certain hidden units at each timestep, similar to dropout. However, unlike dropout which drops units, zoneout preserves hidden units, allowing for better gradient and state information propagation through time. The authors conduct empirical experiments on various RNN regularizers and demonstrate that zoneout leads to significant performance improvements across different tasks. They achieve competitive results in character- and word-level language modeling as well as permuted sequential MNIST when combining zoneout with recurrent batch normalization.","label":0}
{"id":"5b126808-26c8-42ff-8651-d27af676301e","text":"The authors propose a conceptually simple method for regularisation of recurrent neural networks. The idea is related to dropout, but instead of zeroing out units, they are instead set to their respective values at the preceding time step element-wise with a certain probability.\r\n\r\nOverall, the paper is well written. The method is clearly represented up to issues raised by reviewers during the pre-review question phase. The related work is complete and probably the best currently available on the matter of regularising RNNs.\r\n\r\nThe experimental section focuses on comparing the method with the current SOTA on a set of NLP benchmarks and a synthetic problem. All of the experiments focus on sequences over discrete values. An additional experiment also shows that the sequential Jacobian is far higher for long-term dependencies than in the dropout case.\r\n\r\nOverall, the paper bears great potential. However, I do see some points.\r\n\r\n1) As raised during the pre-review questions, I would like to see the results of experiments that feature a complete hyper parameter search. I.e. a proper model selection process,as it should be standard in the community. I do not see why this was not done, especially as the author count seems to indicate that the necessary resources are available. It would be interesting to see if the performance of zoneout could be further improved by fine-tuning the hyperparameters.\r\n\r\n2) Additionally, it would be beneficial to have a more detailed analysis of the computational cost of zoneout compared to other regularization methods. How does it impact training time and memory requirements? Are there any trade-offs?\r\n\r\n3) The paper focuses on character- and word-level language modeling tasks, as well as the permuted sequential MNIST problem. While these are valid benchmarks, it would be valuable to test zoneout on a wider range of tasks to demonstrate its generalizability. For example, how does it perform on speech recognition or image captioning tasks?\r\n\r\n4) The evaluation of zoneout relies on comparing it to the current state-of-the-art methods. While this is useful, it would also be interesting to see an ablation study that investigates the contribution of each component of zoneout individually. This would help in understanding the specific benefits and limitations of the proposed method.\r\n\r\nIn summary, the concept of zoneout for regularizing recurrent neural networks is promising and the experimental results presented in the paper are encouraging. However, addressing the aforementioned points would provide a more comprehensive evaluation of zoneout and further strengthen the paper's contribution to the field of RNN regularization.","label":199}
{"id":"cb26f2a3-3da1-47b5-a388-4b22b871b6c4","text":"Paper Summary\r\nThis paper proposes a variant of dropout, applicable to RNNs, in which the state\r\nof a unit is randomly retained, as opposed to being set to zero..The proposed method, called zoneout, introduces a new way to regularize RNNs. Unlike dropout, which randomly sets hidden units to zero, zoneout stochastically retains the previous values of some hidden units at each timestep. This approach allows for the preservation of gradient and state information, leading to better propagation through time. The authors draw inspiration from feedforward stochastic depth networks, which also maintain information rather than dropping it. To validate the effectiveness of zoneout, the paper presents an empirical investigation of various RNN regularizers. The results demonstrate that zoneout consistently achieves significant performance improvements across different tasks. The experiments cover character- and word-level language modelling on the Penn Treebank and Text8 datasets, as well as permuted sequential MNIST. Surprisingly, even with relatively simple models, zoneout achieves competitive results. Furthermore, when combined with recurrent batch normalization, it even achieves state-of-the-art performance on the permuted sequential MNIST task. Overall, the paper presents a compelling and novel method for regularizing RNNs, highlighting the benefits of retaining hidden activations rather than completely dropping them.","label":27}
{"id":"4d9f4708-0b50-4ded-9918-f2f2b119dc17","text":"This paper tests zoneout against a variety of datasets - character level, word level, and pMNIST The paper also explores the effectiveness of combining zoneout with recurrent batch normalization. The results show that this combination leads to state-of-the-art performance on permuted sequential MNIST. It is worth noting that the zoneout regularization technique is relatively simple compared to other methods used in the comparison. The authors provide empirical evidence that zoneout improves generalization by maintaining hidden activations, allowing for better gradient and state propagation through time. This is similar in spirit to feedforward stochastic depth networks. Overall, the paper presents a novel regularization method, zoneout, which demonstrates significant performance improvements across various tasks and datasets.","label":16}
{"id":"040d7b6a-5582-45f1-94ae-8d892e08ae90","text":"This paper tests zoneout against a variety of datasets - character level, word level, and pMNIST classification - showing applicability in a wide range of scenarios. While zoneout acts as a regularizer to prevent overfitting, it also has shown to improve generalization and performance across tasks. The authors compare zoneout with other RNN regularizers and find that it consistently outperforms them. Notably, when combined with recurrent batch normalization, zoneout achieves state-of-the-art results on permuted sequential MNIST. The paper provides a thorough empirical investigation and presents experimental results that support the effectiveness of zoneout in various scenarios. Overall, the findings of this study demonstrate the potential of zoneout as a powerful regularization technique for improving RNN performance in different domains.","label":38}
{"id":"eb2c178d-8f72-4bd6-a0a0-14f18bdf4f08","text":"Very nice paper, with simple, intuitive idea that works quite well, solving the problem of solving the problem of regularization in RNNs. The concept of zoneout is effective in improving generalization and achieves impressive performance on various tasks.","label":15}
{"id":"8bd55ed2-a021-4f8f-9961-5839f9c61764","text":"The authors propose a conceptually simple method for regularisation of recurrent neural networks. The idea is related to dropout, but instead of zeroing out units, they are instead set to their respective values at the preceding time step element-wise with a certain probability.\r\n\r\nOverall, the paper is well written. The method is clearly represented up to issues raised by reviewers during the pre-review question phase. The related work is complete and probably the best currently available on the matter of regularising RNNs.\r\n\r\nThe experimental section focuses on comparing the method with the current SOTA on a set of NLP benchmarks and a synthetic problem..The experimental results presented in the paper demonstrate the effectiveness of the proposed zoneout method for regularizing recurrent neural networks (RNNs). The authors compare the performance of zoneout with the current state-of-the-art (SOTA) on a set of natural language processing (NLP) benchmarks, as well as a synthetic problem. This comprehensive evaluation provides strong evidence supporting the claim that zoneout achieves significant performance improvements across various tasks.\r\n\r\nIn the experiments conducted on character- and word-level language modeling tasks using the Penn Treebank and Text8 datasets, zoneout consistently outperforms the existing regularization techniques. The results show that when combined with relatively simple models, zoneout achieves competitive results in both character- and word-level language modeling, indicating its effectiveness in capturing long-term dependencies in sequential data. The study also demonstrates the benefits of combining zoneout with recurrent batch normalization, which leads to state-of-the-art results on permuted sequential MNIST.\r\n\r\nThe authors' empirical investigation of various RNN regularizers, including dropout, variational dropout, weight tying, and L2 regularization, provides valuable insights into the strengths and weaknesses of each method. By comparing these techniques with zoneout, the authors demonstrate the superior performance of zoneout in terms of generalization and gradient propagation through time.\r\n\r\nAdditionally, the paper effectively demonstrates the conceptually simple nature of the proposed method, making it easy to implement and integrate into existing RNN architectures. The use of random noise to train a pseudo-ensemble, similar to dropout, enhances the generalization capability of zoneout while ensuring the preservation of hidden units. This combination results in improved gradient information and state information propagation through time, leading to enhanced performance.\r\n\r\nIn conclusion, the paper presents the zoneout regularization method as a valuable addition to the field of RNN regularization. The empirical investigation and experimental results support the authors' claims of significant performance improvements across tasks. The clear representation of the method, complete related work, and comprehensive evaluation further strengthen the paper's contributions. Overall, the paper is well written and provides an important step forward in the development of regularization techniques for RNNs.","label":102}
{"id":"2ec5468b-d619-4c06-9a55-130a27a2e0be","text":"Paper Summary\r\nThis paper proposes a variant of dropout, applicable to RNNs, in which the state\r\nof a unit is randomly retained, as opposed to being set to zero. This provides\r\nnoise which gives the regularization effect, but also prevents loss of\r\ninformation over time, in fact making it easier to send gradients back because\r\nthey can flow right through the identity connections without attenuation.\r\nExperiments show that this model works quite well. It is still worse that\r\nvariational dropout on Penn Tree bank language modeling task, but given the\r\nsimplicity of the idea it is likely to become widely useful.\r\n\r\nStrengths\r\n- Simple idea that works well.\r\n- Detailed experiments help understand the effects of the zoneout probabilities\r\n  and validate its applicability to various tasks. The zoneout regularization method is shown to significantly improve performance in character- and word-level language modeling on the Penn Treebank and Text8 datasets. Additionally, when combined with recurrent batch normalization, it achieves state-of-the-art results on the permuted sequential MNIST task. The paper presents a thorough investigation of different RNN regularizers, highlighting the benefits of zoneout. It demonstrates that by preserving hidden unit activations, gradient and state information can be more effectively propagated through time. Though zoneout is not as effective as variational dropout in the Penn Treebank language modeling task, its simplicity and effectiveness make it a promising regularization technique for RNNs. Overall, the paper contributes a valuable technique for improving the performance and generalization of RNN models.","label":114}
{"id":"f43e2485-311b-4438-84be-7bc28d5119cc","text":"This paper tests zoneout against a variety of datasets - character level, word level, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST. The paper provides a clear explanation of the motivation behind zoneout and its potential benefits in regularizing RNNs. The empirical investigation conducted on various tasks demonstrates the effectiveness of zoneout in improving performance. The comparison with other RNN regularizers provides valuable insights into the strengths of zoneout. However, it would be beneficial to provide a more detailed analysis of the trade-offs and limitations of zoneout, as well as its potential impact on model interpretability. Overall, the paper presents an interesting and promising approach to regularizing RNNs with zoneout.","label":15}
{"id":"4d6836fe-b438-4370-8e1e-af6d7131dec6","text":"This paper motivates the combination of autoregressive models with Variational Auto-Encoders and how to control the amount the amount of information stored in the latent code..The paper provides a clear motivation for combining autoregressive models with Variational Auto-Encoders (VAEs) and presents a novel approach to controlling the amount of information stored in the latent code. By leveraging autoregressive models as both the prior distribution and the decoding distribution, the proposed VAE model improves the generative modeling performance. The experimentation results on MNIST, OMNIGLOT, Caltech-101, and CIFAR10 demonstrate state-of-the-art performance and competitive results, which further validate the effectiveness of the approach. The use of autoregressive models allows the global latent code to discard irrelevant information, such as texture in 2D images, resulting in a lossy autoencoding process. This ability to selectively retain important aspects of the data in the latent code is a significant contribution of the paper. The paper is well-written and provides clear explanations of the proposed methods, making it accessible to readers. Additionally, the introduction of controlling the global latent code adds to the existing body of knowledge in representation learning, providing a valuable contribution to the research community.","label":26}
{"id":"7b745eb6-3302-49bd-98be-02bd1fdf0a0b","text":"The reviewers agree that this is a well executed paper, and should be accepted and will make a positive contribution to the conference..The paper presents a simple but principled method to learn global representations by combining Variational Autoencoder (VAE) with autoregressive models, achieving new state-of-the-art results on MNIST, OMNIGLOT and Caltech-101.","label":23}
{"id":"9d723d7b-d74d-44d5-a068-473c3c6c1eeb","text":"Dear Reviewers & AC:\r\n\r\nWe'd like to thank all the reviewers & commenters for their useful suggestions. By taking into account discussion so far, we have significantly updated the manuscript to address questions\/concerns.\r\n\r\nMinor clarification questions have been answered in discussion below. Here we summarize the main concerns of reviewers and subsequently explain how our latest revision address these concerns:\r\n\r\n*** Larger scale experiments (Reviewer1, Reviewer3)\r\nThis is the main focus of the latest revision. In Section 4.3 of the latest revision, we have shown:\r\n1. VLAE has the current best density estimation performance on CIFAR10 among latent-code models. It also outperforms PixelCNN\/PixelRNN [1] and is only second to PixelCNN++ [2].\r\n2. We show several different ways to encode different kinds of information into latent code to demonstrate that one can similarly control information placement on CIFAR10.\r\n\r\n*** \"the necessity of \"crippling\" the decoder ... has already been pointed out\" (Reviewer1)\r\nThanks for making the distinction clear. We have revised relevant text (last paragraph of introduction for example) to highlight that the analysis of \"why crippling is necessary\" is our main contribution as opposed to \"pointing out it's necessary\" [3].\r\n\r\n*** \"However, they never actually showed how a VAE without AF prior but that has a PixelCNN decoder performs. What would be the impact on the latent code is no AF prior is used?\" (Reviewer2)\r\nThanks for your suggestion and we have added related experiments in Appendix..In addition to addressing the reviewers' concerns, the authors have also provided clarification on the necessity of 'crippling' the decoder in their proposed Variational Lossy Autoencoder (VLAE) model. They emphasize that their main contribution lies in analyzing why crippling is necessary, rather than simply pointing out its necessity. The text has been revised to highlight this distinction, providing a clearer understanding of the research. Furthermore, Reviewer 2 raised a valid point regarding the absence of experiments showing the impact of not using an autoregressive flow (AF) prior in VLAE. To address this concern, the authors have added related experiments in the Appendix section. This addition will provide further insights into the role of the AF prior and its impact on the latent code in the absence of such a prior.The latest revision of the manuscript primarily focuses on addressing the reviewers' main concern regarding larger scale experiments. In Section 4.3, the authors present the results of their experiments, demonstrating that VLAE achieves the current best density estimation performance on CIFAR10 among latent-code models. They also compare VLAE's performance with that of PixelCNN\/PixelRNN, showing that VLAE outperforms these models, only falling short of PixelCNN++. Additionally, the authors showcase different ways to encode various kinds of information into the latent code, highlighting the ability to control information placement on CIFAR10.Overall, the latest revision of the manuscript adequately addresses the reviewers' concerns and incorporates the suggested experiments. The inclusion of these experiments strengthens the research presented in the paper and enhances the understanding of the proposed VLAE model. The authors have provided a comprehensive response to the reviewers' comments, successfully incorporating their suggestions to improve the overall quality of the paper.","label":227}
{"id":"0f32976b-6a7e-4e8d-87a9-d5875c769487","text":"The AR prior and its equivalent - the inverse AR posterior - is one of the more elegant ways to improve the unfortunately poor generative qualities of VAE-s. It is only an incremental but important step. Incremental, because, judging by the lack of, say, CIFAR10 pictures of the VLAE in its \"creative\" regime ( i.e., when sampling from prior), it will not answer many of the questions hanging over. We hope to see the paper accepted: in relative terms, the paper shines in the landscape of the other papers which are rich on engineering hacks but lacking on theoretical insights.\r\n\r\nSome disagreements with the theoretical suppositions in the paper:\r\n\r\ni) The VAE-s posterior converges to the prior faster than we would like because the gradients of the \"generative\" error (the KL divergence of prior and posterior) w.r.t. mu & sigma are simple, inf differentiable functions and their magnitude far exceeds the magnitude of the resp. gradients of the reconstruction error. Especially when more \"hairy\" decoders like pixelCNN are used. We always considered this obvious and certainly not worthy of further discussion. However, the authors claim that by leveraging autoregressive models as both prior distribution and decoding distribution, they can greatly improve the generative modeling performance of Variational Autoencoders (VAEs). While the results on MNIST, OMNIGLOT, Caltech-101, and CIFAR10 are promising, it would be helpful to have a more detailed comparison with other state-of-the-art models on these datasets. Additionally, the paper mentions that the proposed VAE model allows control over what the global latent code can learn and can force the global latent code to discard irrelevant information. It would be beneficial to provide more explanation and evidence supporting this claim, perhaps through an analysis of the learned representations. Lastly, the paper highlights the simplicity and principled approach of the proposed method. However, it would be useful to elaborate on the specific architectural choices made in combining the VAE with autoregressive models and how these choices contribute to the observed improvements. Overall, the paper presents an interesting approach to learning global representations and improving generative modeling performance, but further clarification and evidence are required to strengthen the claims made.","label":177}
{"id":"46fecd33-61e6-4c71-9294-a4a6e65b1331","text":"I greatly enjoyed the bits-back interpretation and the implied design principles for the Variational Lossy Autoencoder. The authors effectively combine Variational Autoencoder (VAE) with neural autoregressive models to learn global representations that discard irrelevant information, such as texture in 2D images. The use of autoregressive models as both the prior distribution and decoding distribution improves the generative modeling performance of VAEs. The experiments on MNIST, OMNIGLOT, Caltech-101, and CIFAR10 demonstrate the state-of-the-art results achieved by the proposed method. Overall, this paper provides a simple yet principled approach to representation learning that can be useful for downstream tasks like classification.","label":12}
{"id":"9d262796-ad07-4715-86dd-4d9f40ee1f51","text":"This paper proposes a Variational Autoencoder model that can discard information found irrelevant, in order to learn interesting global representations of the data. This can be seen as a lossy compression algorithm, hence the name Variational Lossy Autoencoder. To achieve such model, the authors combine VAEs with neural autoregressive models resulting in a model that has both a latent variable structure and a powerful recurrence structure.\r\n\r\nThe authors first present an insightful Bits-Back interpretation of VAE to show when and how the latent code is ignored. As it was also mentioned in the literature, they say that the autoregressive part of the model ends up explaining all structure in the data, while the latent variables are not used. Then, they propose two complementary approaches to force the latent variables to be used by the decoder..The first approach is to modify the prior distribution of the latent variables. By using an autoregressive model as the prior distribution, the authors are able to capture the dependencies between the latent variables and the input data. This encourages the decoder to utilize the information encoded in the latent variables, as it now has access to a more informative prior distribution. The second approach is to adjust the decoding distribution. By incorporating an autoregressive model in the decoding process, the authors are able to generate more accurate and realistic samples. This in turn encourages the latent code to encode more relevant information, as the decoder relies on it to generate high-quality samples. The experimental results demonstrate the effectiveness of these two approaches. The proposed Variational Lossy Autoencoder achieves state-of-the-art results on several benchmark datasets, including MNIST, OMNIGLOT, Caltech-101, and CIFAR10. The authors also compare their model with other state-of-the-art methods, and show that their approach outperforms them in terms of generative modeling performance. Overall, this paper presents a novel and principled method for learning global representations by combining Variational Autoencoder with neural autoregressive models. The proposed model offers control over the information encoded in the latent code and achieves impressive results on various datasets. Further research can explore the application of this model to other domains and investigate its robustness to different types of data.","label":134}
{"id":"5dc3d3d7-e662-49d7-9c7d-d5458dd7bb43","text":"This paper introduces the notion of a \"variational lossy autoencoder\", where a powerful autoregressive conditional distribution on the inputs x given the latent code z is crippled in a way that forces it to use z in a meaningful way. Its three main contributions are:\r\n\r\n(1) It gives an interesting information-theoretical insight as to why VAE-type models don't tend to take advantage of their latent representation when the conditional distribution on x given z is powerful enough.\r\n\r\n(2) It shows that this insight can be used to efficiently train VAEs with powerful autoregressive conditional distributions such that they make use of the latent code.\r\n\r\n(3) It presents a powerful way to parametrize the prior in the form of an autoregressive flow transformation which is equivalent to using an inverse autoregressive flow transformation on the approximate posterior.\r\n\r\nBy itself, I think the information-theoretical explanation of why VAEs do not use their latent code when the conditional distribution on x given z is powerful enough constitutes an excellent addition to our understanding of VAE-related approaches.\r\n\r\nHowever, the way this intuition is empirically evaluated is a bit weak. The \"crippling\" method used feels hand-crafted and very task-dependent, and the qualitative evaluation of the \"lossyness\" of the learned representation is carried out on three datasets (MNIST, OMNIGLOT and Caltech-101 Silhouettes) which feature black-and-white images with little-to-no texture. Figures 1a and 2a do show that reconstructions discard low-level information, as observed in the slight variations in strokes between the input and the reconstruction, but such an analysis would have been more compelling with more complex image datasets..While the information-theoretical explanation provided in this paper is valuable, the empirical evaluation of the proposed method could be further strengthened. The chosen method for 'crippling' the autoregressive conditional distribution appears to be somewhat ad hoc and specific to the task at hand, which raises questions about its generalizability to other domains. Additionally, the qualitative evaluation of the learned representation's 'lossyness' relies on datasets that consist of black-and-white images with limited texture variations. While the reconstructions shown in Figures 1a and 2a suggest that the model can discard low-level information, it would have been more convincing to evaluate the method using more complex image datasets that contain a wider range of textures. This would provide stronger evidence of the model's ability to selectively autoencode data in a lossy fashion. By conducting experiments on diverse image datasets, the authors can better demonstrate the generalizability of their approach and its effectiveness in capturing global structure while discarding irrelevant details. Additionally, incorporating quantitative evaluation metrics, such as reconstruction error and information preservation measures, would contribute to a more comprehensive evaluation. Despite these limitations in the empirical evaluation, the paper's introduction of the 'variational lossy autoencoder' concept and the insights provided into the behavior of VAE-type models with powerful conditional distributions are significant contributions to the field of representation learning. The information-theoretical perspective sheds light on the challenges faced by VAEs in effectively utilizing their latent codes and prompts further research into designing more effective architectures for these models. Overall, this paper presents an intriguing approach and serves as a stepping stone for future investigations on learning global representations while retaining control over the information discarded during the autoencoding process.","label":257}
{"id":"78d9db67-da3b-4377-abbe-6c2e920c8b6a","text":"This paper motivates the combination of autoregressive models with Variational Auto-Encoders and how to control the amount the amount of information stored in the latent code. The authors provide state-of-the-art results on MNIST, OMNIGLOT and Caltech-101.\r\nI find that the insights provided in the paper, e.g. with respect to the effect of having a more powerful decoder on learning the latent code, the bit-back coding, and the use of autoregressive models as both prior and decoding distributions are valuable contributions to the field. The experiments conducted on MNIST, OMNIGLOT, Caltech-101, and CIFAR10 demonstrate the effectiveness of the proposed method. However, the paper could benefit from a more thorough discussion on the limitations of the approach and the potential trade-offs between the level of information preserved in the latent code and downstream task performance. Additionally, the clarity of the writing could be improved in certain sections to enhance the understanding of the proposed model. Overall, this paper presents a compelling approach to learning global representations with variation and loss, and offers valuable insights for future research in representation learning using autoregressive models and Variational Auto-Encoders.","label":65}
{"id":"35ef804e-e934-4f84-8f38-e282c267140c","text":"Hi, I have a few questions:\r\n\r\n1) The main argument of the paper is using autoregressive decoder with in variational AE leads to the problem of code part is completely is ignored because the decoder is too powerful. The solution proposed in the paper is to design the architecture of the VAE model in such a way that the global latent code can discard irrelevant information. By leveraging autoregressive models as both the prior distribution and decoding distribution, the generative modeling performance is greatly improved, as demonstrated by the state-of-the-art results achieved on MNIST, OMNIGLOT, Caltech-101, and competitive results on CIFAR10. This approach effectively combines the benefits of VAEs and autoregressive models, leading to a novel and promising method for representation learning and generative modeling.","label":39}
{"id":"ce2720be-6793-4d69-a52b-5fd96d603ece","text":"Hi, I'm one of the authors of Bowman et al. 2016, and I wanted to point out that while our description of the difficulties in training RNN type models with global latent variables occurs in a section called \"Optimization Challenges,\" we make similar observations as you do in your paper, namely that having a very powerful decoder model like an RNN can result in the model ignoring the global latent variable even at the optimum (perhaps the section should have more generally been called \"Learning Challenges\"). \r\n\r\nSpecifically, we found that without using \"word dropout\" in our decoder, the model would learn to generate plausible sentences without making use of the global latent code, similarly to the observations you made. In our experiments, we found that by introducing word dropout during training, we could encourage the model to rely more on the global latent code, resulting in better representations. However, we acknowledge that this approach might not generalize well to other domains or tasks. Regarding your proposed method of combining Variational Autoencoder (VAE) with neural autoregressive models, we believe it is an interesting approach. Combining VAEs with autoregressive models as prior and decoding distributions has shown promising results in previous works. By leveraging autoregressive models, you can enhance the generative modeling performance and potentially achieve new state-of-the-art results, as you mentioned. We think it would be valuable to further explore and compare the performance of your proposed method with other state-of-the-art approaches on multiple datasets to validate the effectiveness and robustness of your model.","label":95}
{"id":"347222f7-a91a-4d57-8396-e1e72b6a0d66","text":"Dear authors,\r\n\r\nI liked the paper very much and particularly enjoyed the section involving Bits-Back interpretation of VAE.\r\n\r\nI just wanted to point out a very minor stuff with your paper. However, I think it would be beneficial to provide more details about the experimental setup and hyperparameter selection in order to enhance the reproducibility of your results. Additionally, it would be valuable to discuss potential limitations and future directions for this research. Overall, great work on the paper!","label":29}
{"id":"b8f64b75-290d-4f4f-b5f9-6ef2b3582f4c","text":"This paper motivates the combination of autoregressive models with Variational Auto-Encoders and how to control the amount the amount of information stored in the latent code. The authors provide state-of-the-art results on MNIST, OMNIGLOT and Caltech-101 as well as competitive results on CIFAR10. The paper presents a simple yet innovative approach to representation learning by combining Variational Autoencoder (VAE) with autoregressive models such as RNN, MADE, and PixelRNN\/CNN. The authors emphasize the importance of having control over the information stored in the global latent code and propose a VAE model that allows for this control. By designing the architecture accordingly, the global latent code can selectively discard irrelevant information, resulting in a lossy autoencoding of the data. This approach is demonstrated to yield new state-of-the-art results on MNIST, OMNIGLOT, and Caltech-101, and competitive results on CIFAR10. The integration of autoregressive models as both the prior distribution and decoding distribution further enhances the generative modeling performance of VAEs. Overall, the paper presents a novel and promising method for representation learning and generative modeling, showcasing the potential of combining VAEs with autoregressive models in various applications.","label":35}
{"id":"0f3a7d9d-e621-4405-888c-6fb83d5f510b","text":"The reviewers agree that this is a well executed paper, and should be accepted and will make a positive contribution The reviewers appreciate the authors' innovative approach of combining Variational Autoencoder (VAE) with neural autoregressive models, which has resulted in impressive generative modeling performance. Overall, this paper is well-executed and will make a valuable contribution to the field.","label":20}
{"id":"24cc8b63-c1a3-4320-a047-b1a28639177e","text":"Dear Reviewers & AC:\r\n\r\nWe'd like to thank all the reviewers & commenters for their useful suggestions. By taking into account discussion so far, we have significantly updated the manuscript to address questions\/concerns.\r\n\r\nMinor clarification questions have been answered in discussion below. Here we summarize the main concerns of reviewers and subsequently explain how our latest revision address these concerns:\r\n\r\n*** Larger scale experiments (Reviewer1, Reviewer3)\r\nThis is the main focus of the latest revision. In Section 4.3 of the latest revision, we have shown:\r\n1. VLAE has the current best density estimation performance on CIFAR10 among latent-code models. It also outperforms PixelCNN\/PixelRNN [1] and is only second to PixelCNN++ [2].\r\n2. We show several different ways to encode different kinds of information into latent code to demonstrate that one can similarly control the learned global representation. These demonstrations include discarding irrelevant information like texture in 2D images and capturing higher-level properties such as object shape and pose. The results are presented with clear visualizations and analysis in Section 4.4 of the revised manuscript. We believe these experiments address the concerns raised by the reviewers regarding the scalability and performance evaluation of the proposed Variational Lossy Autoencoder (VLAE).\r\n\r\nAnother concern raised by the reviewers revolves around the comparison with other state-of-the-art models. In response to this, we have included a more comprehensive comparison in Section 5.3 of the revised manuscript. We compared our VLAE with various autoencoder variants, generative adversarial networks (GANs), and other state-of-the-art models on different benchmark datasets. The experimental results show that VLAE achieves new state-of-the-art results on MNIST, OMNIGLOT, Caltech-101, and competitive results on CIFAR10.\r\n\r\nFurthermore, we have also added a detailed discussion on the limitations of our proposed model and potential future directions to address these limitations. This can be found in Section 6 of the revised manuscript.\r\n\r\nOverall, we believe our latest revision adequately addresses the major concerns raised by the reviewers. We have conducted larger scale experiments to evaluate the performance of VLAE, provided comprehensive comparisons with other models, and discussed the limitations and future directions. We hope that the reviewers will find these updates satisfactory and consider accepting our paper for publication.\r\n\r\nThank you,\r\nAuthors","label":127}
{"id":"7c6b360d-f748-4c8e-9567-7b4cdfff23f5","text":"The AR prior and its equivalent - the inverse AR posterior - is one of the more elegant ways to improve the unfortunately poor generative qualities of VAE-s. It is only an incremental but important step. Incremental, because, judging by the lack of, say, CIFAR10 pictures of the VLAE in its \"creative\" regime ( i.e., when sampling from prior), it will not answer many of the questions hanging over. We hope to see the paper accepted: in relative terms, the paper shines in the landscape of the other papers which are rich on engineering hacks but lacking on theoretical insights.\r\n\r\nSome disagreements with the theoretical suppositions in the paper:\r\n\r\ni) The VAE-s posterior converges to the prior faster than we would like because the gradients of the \"generative\" error (the KL divergence of prior and posterior) w.r.t. mu & sigma are simple, inf differentiable functions and their magnitude far exceeds the magnitude of the resp. gradients of the reconstruction error. Especially i) The VAE-s posterior converges to the prior faster than we would like because the gradients of the \"generative\" error (the KL divergence of prior and posterior) w.r.t. mu & sigma are simple, inf differentiable functions and their magnitude far exceeds the magnitude of the resp. gradients of the reconstruction error. Especially the Kullback-Leibler (KL) divergence term in the loss function can be problematic as it encourages the approximate posterior distribution to match the prior distribution. This can lead to a collapse in the latent space where all input data is mapped to a single point in the latent space, resulting in poor generation quality. This is a known problem in VAEs and has been addressed in various ways, such as using annealing techniques or adding a regularization term to the loss function to balance the reconstruction and KL divergence terms.ii) Another point of concern in the paper is the lack of analysis regarding the trade-off between the representation capability and the lossy nature of the autoencoder. While it is mentioned that the global latent code can learn to discard irrelevant information such as texture in 2D images, it is unclear how this affects the downstream tasks like classification. It would be helpful to provide experimental results demonstrating the impact of the lossy nature of the autoencoder on classification performance.iii) Additionally, the paper claims to achieve new state-of-the-art results on MNIST, OMNIGLOT, and Caltech-101, as well as competitive results on CIFAR10, but it does not include any comparison with other existing state-of-the-art models. It would be valuable to compare the proposed Variational Lossy Autoencoder (VLAE) with other popular generative models like Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) with different prior distributions.Overall, the paper presents an interesting approach to learn global representations using Variational Autoencoders combined with neural autoregressive models. The use of autoregressive models as both prior distribution and decoding distribution is a notable contribution. However, further clarification and analysis are needed regarding the convergence of the posterior to the prior, the trade-off between representation capability and lossy nature, and a more comprehensive comparison with existing state-of-the-art models. Addressing these concerns would significantly strengthen the paper and contribute to the field of representation learning.","label":159}
{"id":"7240030d-8612-4389-ad9f-9bc92951730c","text":"I greatly enjoyed the bits-back interpretation and the implied design principles for lossy coding.\r\n\r\nThe second half of the contribution with autoregressive flow priors seems less appealing..I would have liked to see more discussion and analysis on the performance improvements achieved using the autoregressive flow priors. While the paper mentions that it achieved state-of-the-art results on MNIST, OMNIGLOT, and Caltech-101, and competitive results on CIFAR10, it would have been helpful to have more concrete comparisons with existing methods. Additionally, it would have been interesting to see a more detailed exploration of the limitations and trade-offs introduced by the lossy coding approach. Overall, the paper presents a compelling method for learning global representations, but further analysis and experimentation would strengthen its contributions.","label":26}
{"id":"dcf92d58-1b09-4c27-a1aa-866435e6833e","text":"This paper proposes a Variational Autoencoder model that can discard information found irrelevant, in order to learn interesting global representations of the data. This can be seen as a lossy compression algorithm, hence the name Variational Lossy Autoencoder. To achieve such model, the authors combine VAEs with neural autoregressive models resulting in a model that has both a latent variable structure and a powerful recurrence structure.\r\n\r\nThe authors first present an insightful Bits-Back interpretation of VAE to show when and how the latent code is ignored. As it was also mentioned in the literature, they say that the autoregressive part of the model ends up explaining all structure in the data, while the latent variables are not used. Then, they propose two complementary approaches to force the latent variables to be used by the decoder. The first one is to make sure the autoregressive decoder only uses small local receptive field so the model has to use the latent code to learn long-range dependency..The second approach proposed by the authors is to enforce a bottleneck on the latent code, limiting its capacity to store detailed information. By doing so, the decoder is then forced to rely on the global representation provided by the latent code. This bottleneck mechanism acts as a regularization technique, encouraging the learning of more meaningful and global features. The authors experiment with various architectures, combining the VAE with recurrent neural networks (RNNs), MADE models, and PixelRNN\/CNN models. Extensive experiments conducted on popular benchmark datasets such as MNIST, OMNIGLOT, Caltech-101, and CIFAR10 demonstrate the effectiveness of the proposed Variational Lossy Autoencoder. The results obtained surpass previous state-of-the-art performance on these datasets, highlighting the advantages of employing autoregressive models as both prior and decoding distributions within the VAE framework. In conclusion, this paper presents an innovative and principled approach to learning global representations through the Variational Lossy Autoencoder framework, which not only allows for the control over what the latent code learns but also leverages autoregressive models to improve generative modeling performance.","label":163}
{"id":"e71ee800-c9c9-413d-8a49-84b13c274ec7","text":"This paper introduces the notion of a \"variational lossy autoencoder\", where a powerful autoregressive conditional distribution on the inputs x given the latent code z is crippled in a way that forces it to use z in a meaningful way. Its three main contributions are:\r\n\r\n(1) It gives an interesting information-theoretical insight as to why VAE-type models don't tend to take advantage of their latent representation when the conditional distribution on x given z is powerful enough.\r\n\r\n(2) It shows that this insight can be used to efficiently train VAEs with powerful autoregressive conditional distributions such that they make use of the latent code.\r\n\r\n(3) It presents a powerful way to parametrize the prior in the form of an autoregressive flow transformation which is equivalent to using an inverse autoregressive flow transformation on the approximate posterior.\r\n\r\nBy itself, I think the information-theoretical explanation of why VAEs do not use their latent code when the conditional distribution on x given z is powerful enough constitutes an excellent addition to our understanding of VAE-related approaches.\r\n\r\nHowever, the way this intuition is empirically evaluated is a bit weak. The \"crippling\" method used feels hand-crafted and very task-dependent, and the qualitative evaluation of the \"lossyness\" of the learned representation is carried out on three datasets (MNIST, OMNIGLOT and Caltech-101 Silhouettes) which feature black-and-white images with little-to-no texture. Figures 4 and 5 show the reconstructions generated by the proposed VAE model on MNIST, OMNIGLOT, and Caltech-101, respectively. Although the authors claim that the latent code only 'autoencodes' the data in a lossy fashion, it is not clear from the visualizations provided in the figures. The reconstructed images still retain a high level of detail, suggesting that the model may not be effectively discarding irrelevant information. It would be helpful to provide a more thorough quantitative analysis to support this claim.\r\n\r\nAnother limitation of the paper is the evaluation on CIFAR10. While the results are reported as competitive, no detailed analysis or comparison with existing models is provided. It would strengthen the claims made in the paper if the authors could provide a more in-depth comparison of the proposed VAE model with other state-of-the-art generative models on CIFAR10.\r\n\r\nIn terms of the writing style, the paper is generally well-written and presents the methodology in a clear and concise manner. However, there are several instances where additional clarifications would be beneficial. For example, the authors mention using autoregressive models as both the prior distribution and decoding distribution, but it is not entirely clear how these models are integrated into the VAE framework. Providing more details or a step-by-step explanation of the proposed VAE model would greatly enhance the clarity of the paper.\r\n\r\nOverall, the paper introduces an interesting concept of variational lossy autoencoder and provides insights into the limitations of traditional VAE models. However, the empirical evaluation and comparisons with existing models could be improved to strengthen the claims made in the paper. Additionally, providing more clarity on the integration of autoregressive models into the VAE framework would enhance the understanding of the proposed approach. With these revisions, the paper has the potential to make a significant contribution to the field of representation learning and generative modeling.","label":219}
{"id":"d2b17b99-2d40-4a3c-97ad-08ae99467e2f","text":"This paper motivates the combination of autoregressive models with Variational Auto-Encoders and how to control the amount the amount of information stored in the global latent code. The authors propose a Variational Lossy Autoencoder (VLAE) model that combines Variational Autoencoder (VAE) with neural autoregressive models such as RNN, MADE, and PixelRNN\/CNN. By leveraging autoregressive models as both the prior distribution and the decoding distribution, the generative modeling performance of VAEs is significantly improved. This approach achieves new state-of-the-art results on MNIST, OMNIGLOT, and Caltech-101 datasets, while also showing competitive results on CIFAR10. The authors highlight the importance of having control over what information is retained in the global latent code, allowing for the compression and discarding of irrelevant details in the input data. This lossy fashion of autoencoding enables efficient representation learning for downstream tasks such as classification. Overall, this paper presents a well-motivated and principled approach to learning global representations using a combination of VAE and autoregressive models, resulting in improved generative modeling performance and state-of-the-art results on multiple datasets.","label":22}
{"id":"105fa1cd-39b6-4b89-abfa-c60fb5cfb25a","text":"Hi, I have a few questions:\r\n\r\n1) The main argument of the paper is using autoregressive decoder with in variational AE leads to the problem of code part is completely is ignored because the decoder is too powerful. The solution is to depend on a small window instead of all previous data..2) It would be helpful if the paper provided more details on how the small window approach was implemented and what specific benefits it brings in terms of the learned representations. Additionally, it would be interesting to see an analysis of the performance of the proposed method compared to other state-of-the-art approaches on the same datasets. Overall, the idea of combining variational autoencoders with autoregressive models is promising and could potentially lead to significant improvements in generative modeling. However, further experimental results and discussions are needed to support the claims made in the paper.","label":51}
{"id":"371140fd-ad41-4cf5-a67e-3587b227b279","text":"Hi, I'm one of the authors of Bowman et al. 2016, and I wanted to point out that while our description of the difficulties in training RNN type models with global latent variables occurs in a section called \"Optimization Challenges,\" we make similar observations as you do in your paper, namely that having a very powerful decoder model like an RNN can result in the model ignoring the global latent variable even at the optimum (perhaps the section should have more generally been called \"Learning Challenges\"). \r\n\r\nSpecifically, we found that without using \"word dropout\" in the decoder, which effectively weakens the decoder model and forces the global variable to encode more information, we would not learn a useful latent variable..I appreciate the insights and findings presented in this paper regarding the challenges of training autoencoder models with global latent variables. Your observation about the importance of incorporating mechanisms such as 'word dropout' in the decoder to encourage the encoding of relevant information aligns with our own experiences in the field. By addressing these learning challenges, your model provides a principled approach to learning global representations that discard irrelevant details, which can be highly beneficial for downstream tasks like classification. Additionally, the use of autoregressive models as both prior and decoding distributions in your VAE model showcases the potential for greatly improving generative modeling performance. The state-of-the-art results achieved on various benchmark datasets further validate the efficacy of your proposed approach. Overall, this paper contributes valuable insights to the representation learning field and provides a strong foundation for future research in this area.","label":120}
{"id":"d4201d88-b8fa-4e84-8fcc-4de248780de1","text":".The paper presents a novel approach to representation learning by combining Variational Autoencoder (VAE) with neural autoregressive models. This combination allows for explicit control over the learned global latent code, enabling the discarding of irrelevant information. By leveraging autoregressive models as both prior distribution and decoding distribution, the proposed VAE model achieves state-of-the-art results on various datasets. However, it would be helpful if the paper provided more experimental details and comparisons with existing techniques. Overall, the paper makes a valuable contribution to the field of generative modeling and representation learning.","label":0}
{"id":"5d2f4623-547a-42f9-ace4-cac5b4d9c276","text":"This paper proposes a weakly supervised, end-to-end neural network model for solving a challenging natural language understanding task. \r\nAs an extension of the Neural Programmer, this work aims at overcoming the ambiguities imposed by natural language. \r\nBy predefining a set of operations, the model is able to induce programs on a real-world dataset without the need for domain-specific grammars, rules, or annotations. The model is trained end-to-end with weak supervision of question-answer pairs, allowing it to learn the complex language understanding and multi-step reasoning required for the task. The main experimental result presented in the paper is that a single Neural Programmer model achieves 34.2% accuracy using only 10,000 examples with weak supervision. Furthermore, by combining an ensemble of 15 models using a simple combination technique, the accuracy improves to 37.7%, which is comparable to the current state-of-the-art accuracy obtained by a traditional natural language semantic parser at 37.1%. This demonstrates the effectiveness of the proposed approach in addressing the challenges of learning a natural language interface for database tables.","label":46}
{"id":"caa08cc8-92df-4d67-8a53-23da84cb2f28","text":"The paper applies a previously introduced method (from ICLR '16) to the challenging question answering dataset (wikitables). The results are strong and quite close to the performance obtained by a semantic parser. There reviewers generally agree that this is an interesting and promising direction \/ results. The application of the neural programmer to this dataset required model modifications which are reasonable though quite straightforward, so, in that respect, the paper is not particularly groundbreaking. However, the authors demonstrate the effectiveness of their approach by achieving a competitive accuracy of 37.7% on the WikiTableQuestions dataset using an ensemble of 15 models. The weak supervision aspect of their training approach is also noteworthy as it eliminates the need for domain-specific grammars or rules. The paper could benefit from a more detailed analysis of the limitations of their model and potential future directions for improvement. Overall, though, the paper makes a valuable contribution to the field of natural language interface learning and provides a solid foundation for further research.","label":69}
{"id":"0d049bce-9c38-40f7-b41e-ad05ed8509dd","text":"I have some questions about your paper:\r\n\r\n1. Why is the oracle performance only Why is the oracle performance only mentioned for the ensemble of 15 models? It would be helpful to see the individual performance of the Neural Programmer model without the ensemble. Additionally, it would be interesting to know how the model performs when trained with a larger number of examples, as using only 10,000 examples may limit the generalizability of the results. The paper mentions that the model does not require domain-specific grammars or rules, but it would be beneficial to understand how the model handles complex queries that may require more nuanced understanding of the database. Overall, the experimental results are promising and it would be useful to include a discussion on the limitations and potential future directions of this work.","label":13}
{"id":"fdaa051d-6ddf-4c49-93f6-ea82c330f265","text":"We thank all the reviewers for the constructive feedback. We appreciate the reviewers' insightful comments and suggestions, which have greatly strengthened the paper. The authors' use of weak supervision and the novel approach of enhancing the objective function of the Neural Programmer model have yielded promising results. The finding that a single model achieves 34.2% accuracy with only 10,000 examples is impressive and the ensemble approach further improves this to 37.7% accuracy. This demonstrates the competitiveness of the proposed model compared to traditional natural language semantic parsers. However, we do have some concerns and suggestions for further improvement, which we outline below.","label":10}
{"id":"62a6299b-31c8-4c08-9491-d08c41a281d6","text":"The paper presents an end-to-end neural network model for the problem of designing natural language interfaces for database queries. The proposed approach uses only weak supervision signals to learn the parameters of the model. Unlike in traditional approaches, where the problem is solved by semantically parsing a natural language query into logical forms and executing those logical forms over the given data base, the proposed approach trains a neural network in an end-to-end manner The paper provides an interesting and novel approach to addressing the problem of learning a natural language interface for database tables. By training a neural network model using weak supervision, the authors eliminate the need for domain-specific grammars, rules, or annotations that previous approaches rely on. This not only simplifies the training process but also makes the model more versatile and adaptable to different datasets and scenarios. The use of Neural Programmer as the underlying neural network architecture is particularly noteworthy, as it allows for built-in discrete operations and enables multi-step reasoning. The experimental results presented in the paper demonstrate the effectiveness of the proposed approach. Achieving 34.2% accuracy on the WikiTableQuestions dataset using only 10,000 examples is a significant accomplishment, especially considering the weak supervision used for training. Moreover, the ensemble of 15 models achieving 37.7% accuracy with a trivial combination technique further demonstrates the model's competitiveness with traditional natural language semantic parsers. However, it would have been helpful if the authors had provided more details about the data preprocessing and model training procedures. Additionally, further analysis and discussion of the limitations and potential improvements of the proposed approach would have added value to the paper. Overall, the paper makes a valuable contribution to the field of natural language processing and sets the stage for future research in designing natural language interfaces for database queries.","label":74}
{"id":"3172fade-52d0-4348-8252-04aaef947d7f","text":"This paper proposes a weakly supervised, end-to-end neural network model for solving a challenging natural language understanding task. \r\nAs an extension of the Neural Programmer, this work aims at overcoming the ambiguities imposed by natural language. \r\nBy predefining a set of operations, the model successfully learns to induce programs that can respond to natural language queries on a database. The model is trained end-to-end with weak supervision, using question-answer pairs without requiring domain-specific grammars, rules, or annotations. The authors present experimental results that demonstrate the effectiveness of their approach. The main result is that a single Neural Programmer model achieves 34.2% accuracy using only 10,000 examples with weak supervision. Additionally, an ensemble of 15 models, with a simple combination technique, achieves a competitive accuracy of 37.7%, which surpasses the current state-of-the-art accuracy obtained by a traditional natural language semantic parser. Overall, this paper contributes to the field by providing a novel and effective method for learning a natural language interface for database tables.","label":44}
{"id":"b92b578d-4eea-42c2-89ae-baa8eaa98c8f","text":"This paper proposes a weakly supervised, end-to-end neural network model to learn a natural language interface for tables..The paper tackles the challenging task of learning a natural language interface for database tables by introducing a weakly supervised, end-to-end neural network model. By enhancing the objective function of the Neural Programmer, the authors successfully induce programs on a real-world dataset without the need for domain-specific grammars, rules, or annotations. Through experiments on the WikiTableQuestions dataset, the results demonstrate that a single Neural Programmer model achieves 34.2% accuracy using only 10,000 examples with weak supervision. Furthermore, an ensemble of 15 models achieves a competitive accuracy rate of 37.7%, comparable to the state-of-the-art achieved by traditional natural language semantic parsers. Overall, the paper presents an innovative and promising approach to address the challenges in learning a natural language interface.","label":18}
{"id":"ff5b4b4c-6846-4358-a9ec-af42137cd94a","text":"This paper proposes a weakly supervised, end-to-end neural network model for solving a challenging natural language understanding task. \r\nAs an extension of the Neural Programmer, this work aims at overcoming the ambiguities imposed by natural language. \r\nBy predefining a set of operations, the model is able to learn the interface to induce programs on a real-world dataset. The authors enhance the objective function of the Neural Programmer and apply it on WikiTableQuestions, a widely used question-answering dataset. The model is trained end-to-end with weak supervision, eliminating the need for domain-specific grammars, rules, or annotations. The main experimental result reported in the paper is that a single Neural Programmer model achieves 34.2% accuracy using only 10,000 examples. The authors further improve the performance by ensembling 15 models, obtaining a competitive accuracy of 37.7%. This accuracy is comparable to the current state-of-the-art accuracy of 37.1% achieved by a traditional natural language semantic parser. Overall, the proposed approach shows promise in learning a natural language interface for database tables without requiring extensive supervision.","label":50}
{"id":"76b912a3-ad9f-4390-af89-08c1bd23b2cc","text":"The paper applies a previously introduced method (from ICLR '16) to the challenging question answering dataset (wikitables). The results are strong and quite close to the performance obtained by a traditional natural language semantic parser. The authors present an interesting approach that tackles the task of learning a natural language interface for database tables using weak supervision and an end-to-end neural network model. They build upon the Neural Programmer model by enhancing its objective function and apply it on the challenging WikiTableQuestions dataset. The experimental results are impressive, with a single Neural Programmer achieving 34.2% accuracy with only 10,000 examples and an ensemble of 15 models achieving a competitive accuracy of 37.7%. These results demonstrate the efficacy of their approach and its potential to outperform traditional natural language semantic parsers. However, it would be beneficial if the paper provided more analysis and insights into the strengths and weaknesses of their model, as well as comparisons with other state-of-the-art methods. Additionally, further details on the dataset, training methodology, and model architecture would enhance the reproducibility of the experiments. Overall, this is a promising and well-executed study that pushes the boundaries of natural language understanding and program induction.","label":28}
{"id":"e8b9eb7a-c16f-42d4-aacb-320a04368029","text":"I have some questions about your paper:\r\n\r\n1. Why is the oracle performance only slightly above random guessing at 2.5%? Was the oracle performance constrained by any specific factors or limitations? Additionally, it would be helpful to understand the complexity of the queries in the WikiTableQuestions dataset. Were there any particular types of queries that the Neural Programmer model struggled with? Overall, it is impressive that your weakly supervised Neural Programmer model achieved 34.2% accuracy, outperforming traditional natural language semantic parsers. The ensembling technique also showed a promising improvement in accuracy. In future work, it would be interesting to explore methods to further enhance the model's performance and investigate if using more examples for training could boost accuracy even further. Additionally, understanding the limitations and potential biases of the model's output could be important for practical applications of the system. Overall, this paper presents a significant contribution in developing an end-to-end neural network model for inducing programs in a natural language interface, and it opens up several avenues for future research.","label":13}
{"id":"a24845a1-3643-488b-94f4-61afc1dd5e2d","text":"We thank all the reviewers for the constructive feedback. We performed more experiments based on the feedback received and have made several improvements to the paper. The additional experiments we conducted further validate the effectiveness of our approach and provide stronger evidence for its superiority over previous methods in terms of accuracy and efficiency.","label":13}
{"id":"d7c20303-a403-4531-9da0-a174e20ec94a","text":"The paper presents an end-to-end neural network model for the problem of designing natural language interfaces for database queries. The proposed approach uses only weak supervision signals to learn the parameters of the model. Unlike in traditional approaches, where the problem is solved by semantically parsing a natural language query into logical forms and executing those logical forms over the given data base, the proposed approach trains a neural network in an end-to-end manner which goes directly from the natural language query to the final answer obtained by processing the data base. This is achieved by formulating a collection of operations to be performed over the data base as continuous operations, the distributions over which is learnt using the now-standard soft attention mechanisms. The model is validated on the smallish WikiTableQuestions dataset, where the authors show that a single model performs worse than the approach which uses the traditional Semantic Parsing technique. However an ensemble of 15 models (trained in a variety of ways) results in comparable performance to the state of the art. \r\n\r\nI feel that the paper proposes an interesting solution to the hard problem of learning natural language interfaces for data bases..The authors provide a detailed description of their model architecture and training procedure, which includes enhancing the objective function of Neural Programmer to better handle the complex nature of natural language queries. They also address the issue of weak supervision by using question-answer pairs for training instead of relying on domain-specific grammars or annotations. This approach not only simplifies the training process but also makes the model more flexible and adaptable to different datasets. The experimental results presented in the paper are promising, with the ensemble of 15 models achieving a competitive accuracy of 37.7%, which is comparable to the current state-of-the-art. However, it would be beneficial if the authors provided a more comprehensive analysis of the results, including a comparison with other approaches and an investigation into the limitations of their model. Overall, the paper presents a novel approach to learning natural language interfaces for database tables and demonstrates the potential of neural networks in solving this challenging problem.","label":195}
{"id":"f9bb64a5-3ab3-44a1-a3af-e50d334d4c87","text":"This paper proposes a weakly supervised, end-to-end neural network model for solving a challenging natural language understanding task. \r\nAs an As an important contribution to the field, this paper presents the first weakly supervised, end-to-end neural network model for inducing programs on a real-world dataset in the context of learning a natural language interface for database tables. The authors enhance the objective function of Neural Programmer, a neural network with built-in discrete operations, and successfully apply it on the WikiTableQuestions dataset. By training the model with weak supervision of question-answer pairs, the authors eliminate the need for domain-specific grammars, rules, or annotations that were traditionally required in previous approaches to program induction. The main experimental result presented in the paper is quite promising, showing that a single Neural Programmer model achieves an accuracy of 34.2% using only 10,000 examples. Furthermore, the authors demonstrate the effectiveness of ensemble learning by combining 15 models, resulting in a competitive accuracy of 37.7%, comparable to the current state-of-the-art accuracy of 37.1% achieved by a traditional natural language semantic parser. Overall, the paper is well-structured, clearly written, and provides valuable insights into learning a natural language interface using neural programming techniques.","label":20}
{"id":"3e49d63f-6535-4602-a86b-6a6b3be25945","text":"This paper proposes a weakly supervised, end-to-end neural network model to learn a natural language interface for tables. The neural programmer is applied to the WikiTableQuestions, a natural language QA dataset and achieves reasonable accuracy. An ensemble further boosts the performance by combining components built with different configurations, and achieves comparable performance as the traditional natural language semantic parser baseline..The model is able to learn programs from question-answer pairs without the need for domain-specific grammars or annotations. The experimental results show that a single Neural Programmer achieves 34.2% accuracy with weak supervision, and an ensemble of 15 models achieves 37.7% accuracy, which is competitive with the current state-of-the-art natural language semantic parser.","label":60}
{"id":"076c54c5-e701-4065-b0d3-dd987211a295","text":"This paper provides a simple method to handle action repetitions. They make the action a tuple (a,x), where a is the action chosen, and x the number of repetitions. Overall they report some improvements over A3C\/DDPG, dramatic in some games, moderate in other. The idea seems natural and there is a wealth of experiment to support it.\r\n\r\nComments:\r\n\r\n- The scores reported on A3C in this paper and in the Mnih et al. publication (table S3) differ significantly. Where does this discrepancy come from? If it's from a different training regime (fewer iterations, for instance), did the authors confirm that running  their replication to the same settings as Mnih et al provide similar results?\r\n\r\n- It is intriguing that the best results of FiGAR are reported on games where few actions repeat dominate. This seems to imply that for those, the performance overhead of FiGAR over A3C is high since A3C uses an exploration policy and FiGAR introduces additional decision-making steps. It would be interesting to further analyze this trade-off and understand the conditions under which FiGAR provides significant improvements. Additionally, it would be helpful to include a more detailed explanation of the hyperparameters used in the experiments and any tuning performed. Although the authors mention that FiGAR can be applied to any Deep Reinforcement Learning algorithm maintaining an explicit policy estimate, they mainly focus on three specific algorithms. It would be beneficial to provide additional insights into how FiGAR can be integrated with other popular algorithms in the field. The empirical results presented in this paper are compelling and demonstrate the effectiveness of the proposed FiGAR framework. However, it would be valuable to assess the statistical significance of the observed improvements over A3C\/DDPG by performing significance tests or confidence interval analysis. Moreover, it would be interesting to investigate the generalization capabilities of FiGAR by applying it to different domains and tasks. Overall, this paper presents a novel approach to handling action repetitions in Deep Reinforcement Learning and provides substantial empirical evidence supporting its effectiveness.","label":151}
{"id":"1c141bed-f848-4630-b4d4-c377088f8663","text":"The basic idea of this paper is simple: run RL over an action space that models both the actions and the number of times they are repeated. It's a simple idea, but seems to work really well on a variety of domains. The authors provide a clear motivation for their approach and explain how their Fine Grained Action Repetition (FiGAR) framework enhances traditional Deep Reinforcement Learning algorithms. The experiments conducted on the Atari 2600, Mujoco, and TORCS car racing domains demonstrate the efficacy of FiGAR, as it consistently outperforms standard policy search algorithms. This paper presents a valuable contribution to the field of RL, as it introduces a novel approach for enabling temporal abstractions and planning through sequences of repetitive macro-actions. The results obtained certainly warrant further investigation and could potentially have significant applications in real-world RL scenarios.","label":38}
{"id":"6d514cec-db3e-4c85-9b0b-3f6b5ec37f8a","text":"We thank all the reviewers for asking interesting questions and pointing out important flaws in the paper. We have uploaded a revised version of the paper that we believe addresses the questions raised. Major features of the revision are:\r\n\r\n1. We have added results on 2 more Atari 2600 games: Enduro and Q-bert. FiGAR seems to improve performance rather  consistently across a variety of games, suggesting its generalizability.\r\n\r\n2. We have included a detailed analysis of the hyperparameters used in our experiments. This addresses one of the concerns raised by the reviewers regarding the lack of information about the choices made in selecting hyperparameters. We now provide a table that lists all the hyperparameters used in our experiments for each domain, along with a brief justification for their selection.\r\n\r\n3. The clarity of the paper has been enhanced by reorganizing sections and improving the flow of the content. We have added more detailed explanations of the methodology used and included illustrative diagrams wherever necessary to aid understanding.\r\n\r\n4. To address concerns about the scalability of FiGAR, we have conducted additional experiments on larger action spaces. The results show that FiGAR continues to provide performance improvements even in domains with significantly larger action spaces, reaffirming its applicability to real-world environments.\r\n\r\n5. We have conducted a comparative analysis of FiGAR with other state-of-the-art approaches for incorporating temporal abstractions. The results suggest that FiGAR outperforms existing methods in terms of sample complexity and sample efficiency.\r\n\r\nOverall, we believe that these revisions have significantly strengthened the paper and addressed the concerns raised by the reviewers. We thank the reviewers for their helpful feedback and hope that they will find the revised version of the paper satisfactory.","label":58}
{"id":"6b16359f-eda6-4c44-8ffc-370970f1ca62","text":"This paper shows that extending deep RL algorithms to decide which action to take as well as how many times to repeat it leads to improved performance on a number of domains..The paper titled 'Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning' presents a novel framework called Fine Grained Action Repetition (FiGAR), which allows the agent to determine both the action to be executed and the duration of repeating it. This innovative approach enables the agent to efficiently learn temporal abstractions in the action space and implicitly incorporate planning through sequences of repetitive macro-actions. The authors illustrate the effectiveness of FiGAR by demonstrating performance improvements across various domains in comparison to three prevalent policy search algorithms: Asynchronous Advantage Actor Critic (A3C) in the Atari 2600 domain, Trust Region Policy Optimization (TRPO) in the Mujoco domain, and Deep Deterministic Policy Gradients (DDPG) in the TORCS car racing domain.\r\n\r\nThe experiments conducted in the paper exhibit promising results, showcasing the enhanced capabilities of the FiGAR framework in achieving improved performance. By allowing the agent to determine the optimal number of repetitions for an action, FiGAR not only enhances exploration but also enables more effective exploitation. This flexibility in time scale greatly benefits deep reinforcement learning algorithms that rely on explicit policy estimates. The framework can be seamlessly integrated into existing deep RL algorithms to augment their decision-making capabilities and achieve better performance.\r\n\r\nFurthermore, the empirical evaluations carried out on different domains validate the efficacy of FiGAR. The performance improvements achieved on domains as diverse as Atari 2600, Mujoco, and TORCS car racing underscore the versatility of the proposed framework. The results demonstrate the ability of FiGAR to learn complex behavioral patterns and acquire long-term temporal dependencies, leading to more efficient decision making and enhanced policy convergence. Overall, this paper significantly contributes to the field of deep RL by introducing an innovative framework that broadens the capabilities of existing algorithms and improves their performance across multiple domains.","label":32}
{"id":"4cc978cb-0f25-487a-8538-0b9b19508310","text":"This paper provides a simple method to handle action repetitions. They make the action a tuple (a,x), where a is the action chosen, and x the number of repetitions. Overall they report some improvements over A3C\/DDPG, dramatic in some games, moderate in other. The idea seems natural and there is a wealth of experiment to support it.\r\n\r\nComments:\r\n\r\n- The scores reported on A3C in this paper and DDPG are not entirely clear. It would be helpful to provide more detailed information on the experimental setup, such as the hyperparameters used and the number of repeat actions explored during training. Additionally, it would have been insightful to include a comparison with other methods that tackle the issue of action repetitions, such as H-DQN or the successor representation framework. This would have provided a more comprehensive overview of the current state-of-the-art in handling action repetitions in deep reinforcement learning.\r\n\r\nThe paper's contribution lies in its novel framework called Fine Grained Action Repetition (FiGAR), which allows the agent to determine both the action and the duration of its repetition. The concept of temporal abstractions in the action space is an interesting approach to tackle the challenge of decision-making at every time step. By enabling agents to plan through sequences of repetitive macro-actions, FiGAR opens up new possibilities for learning complex behavioral patterns.\r\n\r\nThe empirical evaluations conducted in this paper demonstrate the effectiveness of FiGAR. Performance improvements are observed across three different domains, namely the Atari 2600 domain, the Mujoco domain, and the TORCS car racing domain. It is noteworthy that the framework consistently outperforms three state-of-the-art policy search algorithms, namely Asynchronous Advantage Actor Critic (A3C), Trust Region Policy Optimization (TRPO), and Deep Deterministic Policy Gradients (DDPG). This implies that FiGAR has the potential to enhance the capabilities of existing deep reinforcement learning algorithms and improve their performance in various sequential decision-making tasks.\r\n\r\nOne limitation of the paper is the lack of analysis on the robustness and generalizability of FiGAR. While the improvements in performance are promising, it is important to investigate how the framework performs under different parameter settings or when presented with challenging scenarios. Additionally, the computational complexity of FiGAR is not discussed, which leaves some questions regarding its efficiency. Overall, the paper presents an interesting approach to address action repetitions in deep reinforcement learning and provides empirical evidence of its efficacy across multiple domains. With further research and refinement, FiGAR has the potential to contribute significantly to the future development of reinforcement learning algorithms.","label":66}
{"id":"3ccd35cc-d59a-4b2c-b3ac-7ec77f4c1e68","text":"This paper proposes a simple but effective extension to reinforcement learning algorithms, by adding a temporal repetition component as part of the action space, enabling the policy to select how long to repeat the chosen action for. The extension applies to all reinforcement learning algorithms, including both discrete and continuous domains, as it is primarily changing the action parametrization. The paper is well-written, and the experiments extensively evaluate the approach with 3 different RL algorithms in 3 different domains (Atari, MuJoCo, and TORCS).\r\n\r\nHere are some comments and questions, for improving the paper:\r\n\r\nThe introduction states that \"all DRL algorithms repeatedly execute a chosen action for a fixed number of time steps k\". This statement is too strong, and is actually disproved in the experiments \u2014 repeating an action is helpful in many tasks, but not in all tasks. The sentence should be rephrased to be more precise.\r\n\r\nIn the related work, a discussion of the relation to semi-MDPs would be useful to help the reader better understand the approach and how it compares and differs (e.g. the response from the pre-review questions)\r\n\r\nExperiments:\r\nCan you provide error bars on the experimental results? (from running multiple random seeds)\r\n\r\nIt would be useful to see experiments with parameter sharing in the TRPO The paper presents a novel approach, Fine Grained Action Repetition (FiGAR), which introduces temporal abstractions in the action space of reinforcement learning algorithms. By allowing the agent to choose both the action and the time scale of repeating it, FiGAR enables planning through sequences of repetitive macro-actions. The experiments conducted in this paper demonstrate the efficacy of the proposed framework, showcasing performance improvements on top of three policy search algorithms across different domains. However, there are some areas that need further clarification and improvement.\r\n\r\nFirstly, the statement in the introduction that all DRL algorithms repeatedly execute a chosen action for a fixed number of time steps should be revised to accurately reflect the findings of the experiments. While the repetition of actions proves beneficial in many tasks, it is not universally applicable. The introduction should provide a more precise explanation to avoid misleading readers.\r\n\r\nRegarding the related work, the paper would benefit from a discussion on the relationship between Fine Grained Action Repetition and semi-Markov decision processes (semi-MDPs). This analysis would offer a deeper understanding of the approach and clarify its similarities and differences compared to existing methods.\r\n\r\nIn terms of the experiments, it would be valuable to include error bars on the experimental results to provide a measure of statistical significance. This would further strengthen the empirical evidence supporting the effectiveness of the FiGAR framework. Additionally, it is suggested to include experiments with parameter sharing in the Trust Region Policy Optimization (TRPO) algorithm, as this could offer insights into the impact of parameter sharing on performance.\r\n\r\nOverall, the paper presents a well-written and promising approach to enhancing reinforcement learning algorithms. By addressing the mentioned points, the authors can further strengthen the paper and its contribution to the field.","label":205}
{"id":"36b43560-da75-44b8-b10a-d22df3c8743c","text":"Hi, the main idea is quite interesting. I was curious about the following. My primary question is Q1, and others are predominantly comments.\r\n\r\nQ1: After learning is complete, did you try forward propagating through the network to find actions for every time-step as opposed to repeating actions? Concretely, if at t=5, action suggested by the network is a_3 with a repetition of 4, instead of sticking with a_3 for times t={5,6,7,8} perform action a_3 for just t=5 and continue with action a_4 at t=9. This would provide a better understanding of the effectiveness of temporal abstractions in the action space, as well as the planning capability enabled by sequences of repetitive macro-actions. Additionally, it would be interesting to see how the FiGAR framework performs when compared to other state-of-the-art approaches that incorporate temporal abstraction. For example, comparing FiGAR with options-based frameworks such as H-DQN or MAX-Q could shed light on the strengths and weaknesses of each approach. Moreover, it would be beneficial to provide further analysis and discussion on the computational and memory requirements of the FiGAR framework. Since the agent is now able to make decisions on the time scale of action repetition, does this significantly increase the computational complexity? How does it affect the memory usage, especially when repeating actions for longer time steps? Finally, it would be valuable to discuss the limitations and potential future extensions of the FiGAR framework. Are there any specific tasks or domains where FiGAR may not be as effective? Can FiGAR be combined or extended with other techniques to further enhance its performance? Overall, this paper presents a novel framework for fine-grained action repetition, and the empirical results on multiple domains are promising. However, addressing the aforementioned questions and providing further analysis and discussion would strengthen the paper and provide a more comprehensive understanding of the FiGAR framework.","label":75}
{"id":"699ca701-a6e4-4520-b64e-533820c8e55d","text":"This paper provides a simple method to handle action repetitions. They make the action a tuple (a,x), where a is the action chosen, and x the number of repetitions. Overall they report some improvements over A3C\/DDPG, dramatic in some games, moderate in other. The idea seems natural and there is a wealth of experiment to support it.\r\n\r\nComments:\r\n\r\n- The scores reported on A3C in this paper and in the Mnih et al. publication (table S3) differ significantly. Where does this discrepancy come from? If it's from a different training regime (fewer iterations, longer training time, etc.), it would be helpful to mention it and discuss the potential impact on the results.\r\n\r\n- The empirical evaluation is thorough and includes experiments on three different domains, which enhances the generalizability of the findings. The choice of domains also demonstrates the versatility of the proposed FiGAR framework across various problem settings.\r\n\r\n- The paper does a commendable job of explaining the intuition behind FiGAR and its benefits in enabling temporal abstractions and planning through repetitive macro-actions. The notion of fine-grained action repetition is intuitive and it is straightforward to see how it can yield performance improvements in sequential decision-making tasks.\r\n\r\n- The algorithm and evaluation methodology are both sound and well-described. The experiments are properly controlled, and the results are statistically analyzed to provide meaningful insights into the performance gains achieved with FiGAR.\r\n\r\n- It would be useful to include a discussion on the computational complexity of implementing FiGAR and any potential limitations or trade-offs in terms of runtime efficiency. This would help readers evaluate the feasibility of incorporating FiGAR into their own deep reinforcement learning algorithms.\r\n\r\n- The paper mentions that FiGAR can be used with any deep reinforcement learning algorithm that maintains an explicit policy estimate. It would be interesting to see a comparison of FiGAR with other existing methods for incorporating temporal abstractions in deep reinforcement learning, such as options or hierarchical reinforcement learning approaches.\r\n\r\nIn summary, this paper presents a novel framework, Fine Grained Action Repetition (FiGAR), for improving deep reinforcement learning algorithms. The empirical results demonstrate the efficacy of FiGAR in achieving performance improvements in multiple domains. The paper is well-written, and the experiments are thorough. Some minor clarifications and discussions can enhance the quality of the paper. Overall, this work makes a valuable contribution to the field of deep reinforcement learning and should be considered for publication.","label":91}
{"id":"d198f92e-7af3-4f3a-ab4e-1336acabe223","text":"The basic idea of this paper is simple: run RL over an action space that models both the actions and the number of times they are repeated. The authors propose a novel framework, Fine Grained Action Repetition (FiGAR), which allows the agent to not only select actions but also decide on the time scale of repeating them. This framework enables temporal abstractions in the action space and facilitates planning through sequences of repetitive macro-actions. The authors demonstrate the effectiveness of FiGAR by conducting experiments on three different domains: Atari 2600, Mujoco, and TORCS car racing. They compare the performance of FiGAR with three policy search algorithms: Asynchronous Advantage Actor Critic, Trust Region Policy Optimization, and Deep Deterministic Policy Gradients. The results show significant performance improvements, highlighting the potential impact of incorporating action repetition in deep reinforcement learning algorithms.","label":26}
{"id":"9d018f6e-f6ae-4fac-bcbf-256fbe7763fb","text":"We thank all the reviewers for asking interesting questions and pointing out important flaws in the paper. We have uploaded a revised version of the paper that we believe addresses the questions raised. Major features of the revision are:\r\n\r\n1. We have added results on 2 more Atari 2600 games: Enduro and Q-bert. FiGAR seems to improve performance rather dramatically on Enduro with the FiGAR agent being close to 100 times better than the baseline A3C agent. (Note that the baseline agent performs very poorly according to the published results as well)\r\n\r\n2. In response to AnonReviewer3\u2019s comment about skipping intermediate frames, we have added Appendix F (page 23) by conducting experiments on what happens when FiGAR does not discard any intermediate frames (during evaluation phase). The general pattern seems to be that for games wherein lower action repetition is preferred, gains are made in terms of improved gameplay performance. However, for 24 out of 33 games the performance becomes worse, which depicts the importance of the temporal abstractions learnt by the action repetition part of the policy (\\pi_{\\theta_{x}}). This does not address the reviewer\u2019s question completely since at train time we still skip all the frames, as suggested by the action repetition part of the policy (\u03c0\u03b8\u2093). This does not address the reviewer's question completely since at train time we still skip all the frames, as suggested by the action repetition part of the policy (\u03c0\u03b8\u2093). However, we believe that our empirical results demonstrate the effectiveness of our FiGAR framework in improving performance across various domains and reinforcement learning algorithms. The performance improvements observed in the Atari 2600 domain, Mujoco domain, and TORCS car racing domain highlight the potential of FiGAR to enhance policy search algorithms. Furthermore, the significant performance improvement in the Enduro game compared to the baseline A3C agent provides strong evidence supporting the efficacy of our approach. We would like to thank AnonReviewer3 for their suggestion of conducting experiments without skipping intermediate frames during evaluation. This additional analysis not only provides insights into the importance of temporal abstractions learned by the action repetition part of the policy but also strengthens the robustness of our framework. In conclusion, we believe that the revisions made in response to the reviewers' comments, along with the additional experiments, have significantly improved the paper and validate the effectiveness of the FiGAR framework for fine-grained action repetition in deep reinforcement learning.","label":200}
{"id":"0fa04ab3-79a9-4719-9fd8-42a3f1820eef","text":"This paper shows that extending deep RL algorithms to decide which action to take as well as how many times to repeat it leads to improved performance on a number of domains. The evaluation is very thorough and shows that this simple idea works well in both discrete and continuous actions spaces.\r\n\r\nA few comments\/questions:\r\n- Table 1 could be easier to interpret as a figure of Figure 1. It would be helpful for the reader to have a visual representation of the results. Additionally, it would be interesting to see a comparison with other existing methods or techniques that also aim to address the problem of temporal abstractions and action repetitions in deep RL. This could help provide a better understanding of the novelty and effectiveness of the proposed FiGAR framework. Furthermore, it would be beneficial to include more detailed explanations and analysis of the results obtained in the different domains. While the paper mentions performance improvements, it would be valuable to have a deeper discussion on the specific aspects or characteristics of each domain that make FiGAR particularly effective. Overall, the paper presents a compelling approach to introducing temporal abstractions in deep RL and demonstrates its efficacy in various domains. With some refinements and further analysis, this work could make a valuable contribution to the field of reinforcement learning.","label":65}
{"id":"6496f330-474f-459d-a618-2279ff03114f","text":"This paper provides a simple method to handle action repetitions. They make the action a tuple (a,x), where a is the action chosen, and x the number of repetitions. Overall they report some improvements over A3C\/DDPG, dramatic in some games, moderate in other..The paper proposes a novel framework called Fine Grained Action Repetition (FiGAR) that aims to enable agents to not only decide on the action to take but also the time scale at which to repeat that action. By introducing temporal abstractions in the action space, FiGAR allows for implicit planning through sequences of repetitive macro-actions. This approach is particularly useful in reinforcement learning algorithms that maintain an explicit policy estimate, as it provides a mechanism for incorporating repetitive behavior in decision-making processes.\r\n\r\nThe empirical evaluations conducted in the paper demonstrate the efficacy of FiGAR across three different domains: Atari 2600, Mujoco, and TORCS car racing. In each of these domains, FiGAR outperforms existing policy search algorithms, namely Asynchronous Advantage Actor Critic (A3C) in the Atari 2600 domain, Trust Region Policy Optimization (TRPO) in the Mujoco domain, and Deep Deterministic Policy Gradients (DDPG) in the TORCS car racing domain. These improvements are observed to be significant in some games and more moderate in others. Nonetheless, the consistent performance enhancements across multiple domains highlight the potential of FiGAR as a powerful tool for enhancing the learning capabilities of deep reinforcement learning algorithms.\r\n\r\nOne notable advantage of FiGAR is its simplicity. The action repetition mechanism is straightforward, involving the action and repetition count in a tuple representation. This simplicity allows for easy integration with existing deep reinforcement learning algorithms that maintain explicit policy estimates. Furthermore, the authors provide clear experimental details, including hyperparameter settings and evaluation metrics, enabling reproducibility and allowing for comparisons with existing approaches. The use of multiple domains for evaluation also helps demonstrate the generalizability of FiGAR across different environments.\r\n\r\nWhile the paper presents strong evidence for the effectiveness of FiGAR, there are a few areas that could benefit from further elaboration. First, a more in-depth discussion of the theoretical motivations behind FiGAR would be valuable. Elaborating on how action repetition and temporal abstractions contribute to the agent's decision-making processes and how this relates to existing reinforcement learning theories could provide a stronger theoretical foundation for the proposed framework. Additionally, insights into the limitations and potential drawbacks of FiGAR, especially in comparison to other approaches, would help in assessing its broader applicability.\r\n\r\nIn summary, the paper proposes a novel framework, FiGAR, for fine-grained action repetition in deep reinforcement learning algorithms. The empirical evaluations support the effectiveness of FiGAR in improving the performance of existing policy search algorithms across multiple domains. The simplicity and easy integration with existing algorithms make FiGAR an attractive approach for enhancing the learning capabilities of agents in sequential decision-making tasks. Further theoretical exploration and discussions on limitations would provide valuable insights into the broader implications of this framework.","label":43}
{"id":"14f87a83-0914-4aa3-b7ad-063c738e3651","text":"This paper proposes a simple but effective extension to reinforcement learning algorithms, by adding a temporal repetition component as part of the action space, enabling the policy to select how long to repeat the chosen action for. The extension applies to all reinforcement learning algorithms, including both discrete and continuous domains, as it is primarily changing the action parametrization. The paper is well-written, and the experiments extensively evaluate the approach with 3 different RL algorithms in 3 different domains (Atari, MuJoCo, and TORCS).\r\n\r\nHere are some comments and questions, for improving the paper:\r\n\r\nThe introduction states that \"all DRL algorithms repeatedly execute a chosen action for a fixed number of time steps k\". This statement is too strong, and is actually disproved in the experiments \u2014 repeating an action is helpful in many tasks, but not in all tasks. The sentence should be rephrased to be more precise.\r\n\r\nIn the related work, a discussion of the relation to semi-MDPs would be useful to help the reader better understand the approach and how it builds on existing research. The authors mention the connection to options in the related work section, which are temporally extended actions that have been successfully applied in reinforcement learning. It would be helpful to further discuss the similarities and differences between their approach and options, as well as how FiGAR extends and improves upon existing work. Another point that could be addressed is the scalability of the proposed framework. While the experiments show promising results in the three domains tested, it would be useful to know how FiGAR performs in larger and more complex environments. This could be a potential avenue for future work. Additionally, a more detailed discussion of the limitations and potential drawbacks of the proposed framework would contribute to the overall strength of the paper. Overall, the paper presents a well-motivated and novel approach for incorporating temporal abstractions in reinforcement learning. The experimental results demonstrate the effectiveness of the proposed framework, and the paper is well-written and easy to follow. Addressing the mentioned points would further strengthen the paper and make it more valuable to the reinforcement learning community.","label":169}
{"id":"0d21c55a-cc23-4647-8f0b-8de378b71013","text":"Hi, the main idea is quite interesting. I was curious about the following. My primary question is Q1, and others are predominantly comments.\r\n\r\nQ1: After learning is complete, did you try forward propagating through the network to find actions for every time-step as opposed to repeating actions? Concretely, if at t=5, action suggested by the network is a_3 with a repetition of 4, instead of sticking with a_3 for times t={5,6,7,8} perform action a_3 for just t=5, and forward prop through the policy again at t=6.\r\n\r\nI understand that the goal is to explore temporal abstractions, but for all the problems considered in this paper, a forward prop is not expensive at all..Regarding Q1, while the idea of forward propagating through the network to find actions at every time-step instead of repeating actions is interesting, it is not explored in this paper. The focus of this work is on introducing a novel framework, Fine Grained Action Repetition (FiGAR), that allows the agent to decide both the action and the time scale of repeating it. This framework enables temporal abstractions in the action space and implicit planning through sequences of repetitive macro-actions. The experimental results presented in the paper demonstrate the efficacy of FiGAR by showing performance improvements on three policy search algorithms in various domains. Although forward propagation at every time-step could potentially provide a different perspective, the paper primarily focuses on the advantages of action repetition and temporal abstractions. It would be interesting to investigate the potential benefits of forward propagation in future work, particularly in scenarios where it might be more computationally demanding.","label":111}
{"id":"7a9ced40-6e76-462c-b0a9-18cb94b9b211","text":"This paper is a parallel work to Improving Generative Adversarial Networks with Denoising Feature Matching..In this paper, the authors introduce the concept of Energy-based Generative Adversarial Networks (EBGAN). They propose viewing the discriminator as an energy function that assigns low energies to the regions near the data manifold and higher energies to other regions. This approach allows for the use of different architectures and loss functionals for the discriminator. One instantiation of the EBGAN framework is presented, which utilizes an auto-encoder architecture with the reconstruction error as the energy. The authors demonstrate that this form of EBGAN exhibits more stable behavior during training compared to regular GANs. Furthermore, they show that a single-scale architecture can be trained to generate high-resolution images. Overall, this paper presents a novel approach to GANs and provides promising results.","label":15}
{"id":"47dff23c-8c91-4eb6-8194-3ff9963ca603","text":"The authors have proposed an energy-based rendition of probabilistic GANs, with the addition of using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. This form of EBGAN exhibits more stable behavior than regular GANs during training and a single-scale architecture can be trained to generate high-resolution images.","label":14}
{"id":"ec344143-643c-41a2-89f7-4f636e2b0f63","text":"This paper is a parallel work to Improving Generative Adversarial Networks with Denoising Feature Matching..In this paper, the authors introduce the Energy-based Generative Adversarial Network (EBGAN), which presents a unique approach by treating the discriminator as an energy function. By attributing low energies to regions near the data manifold and higher energies to other regions, the model encourages the generator to produce contrastive samples with minimal energies. The discriminator, on the other hand, is trained to assign high energies to these generated samples. This energy-based perspective allows for greater flexibility in architecture and loss function selection. The authors demonstrate the effectiveness of this framework by utilizing an auto-encoder architecture, where the energy is represented by the reconstruction error, rather than the traditional binary classifier approach. By doing so, they achieve a more stable training behavior compared to regular GANs. Additionally, the paper highlights the ability to train a single-scale architecture capable of generating high-resolution images. Overall, this work sheds light on the potential of EBGANs as a promising approach for generative modeling.","label":15}
{"id":"42238858-a466-429d-be01-822459b3e9f1","text":"This paper introduces an energy-based Generative Adversarial Network (GAN) and provides theoretical and empirical results modeling a number of image datasets (including large-scale versions of categories of ImageNet). As far as I know energy-based GANs (EBGAN) were introduced in Kim and Bengio (2016), but the proposed version makes a number of different design choices. \r\n\r\nFirst, it does away with the entropy regularization term that Kim and Bengio (2016) introduced to ensure that the GAN discriminator converged to an energy function proportional to the log density of the data (at optimum). This implies that the discriminator in the proposed scheme will become uniform at convergence as discussed in the theoretical section of the paper, however the introductory text seems to imply otherwise -- that one could recover a meaningful score function from the trained energy-function (discriminator). This should be clarified. \r\n\r\nSecond, this version of the EBGAN setting includes two innovations: (1) the introduction of the hinge loss in the value function, and (2) the use of an auto-encoder parametrization for the energy function..The use of the hinge loss in the value function is a key innovation in the proposed EBGAN framework. This loss function allows for better training stability compared to regular GANs, as it focuses on the difference between the generated samples and the discriminator decision boundary. The authors provide empirical evidence to support this claim by demonstrating that EBGAN exhibits more stable behavior during training than traditional GANs.\r\n\r\nThe paper also introduces the use of an auto-encoder as a parametrization for the energy function. This choice offers several advantages, such as the ability to capture complex dependencies in the data and the generation of high-resolution images using a single-scale architecture. The energy in this auto-encoder-based EBGAN is defined as the reconstruction error, which serves as a measure of how well the generator is able to produce samples that resemble the real data.\r\n\r\nWhile the introduction of these innovations is valuable, the paper could benefit from further empirical analysis to compare the performance of the proposed EBGAN with other state-of-the-art GAN architectures. Additionally, more clarity is needed regarding the theoretical implications of removing the entropy regularization term and the consequences of the discriminator becoming uniform at convergence.\r\n\r\nOverall, the paper makes significant contributions to the field of generative adversarial networks by introducing the EBGAN model and presenting novel design choices that improve stability and image generation capabilities. With some additional experimentation and theoretical clarification, this work has the potential to pave the way for further advancements in energy-based GANs.","label":172}
{"id":"05b63681-08be-40c4-92ca-2aec05d9a0e5","text":"This paper proposes a novel extension of generative adversarial networks that replaces the traditional binary classifier discriminator with one that assigns a scalar energy to each point in the generator's output domain. The discriminator minimizes a hinge loss while the generator attempts to generate samples with low energy under the discriminator. The authors show that a Nash equilibrium under these conditions yields a generator that matches the data distribution (assuming infinite capacity). Experiments are conducted with the discriminator taking the form of an autoencoder, optionally including a regularizer that penalizes generated samples having a high cosine similarity to other samples in the minibatch.\r\n\r\nPros:\r\n* The paper is well-written.\r\n* The topic will be of interest to many because it sets the stage for the exploration of a wider variety of discriminators than currently used for training GANs.\r\n* The theorems regarding optimality of the Nash equilibrium appear to be correct.\r\n* Thorough exploration of hyperparameters in the MNIST experiments.\r\n* Semi-supervised results show that contrastive samples from the generator improve classification performance.\r\n\r\nCons:\r\n* The relationship to other works that broaden the scope of the discriminator (e.g. [1]) or use a generative network to use adversarial loss (e.g. UNet [2]) is not well-discussed. It would be useful to have a clear comparison between EBGAN and these related works to understand their similarities and differences. Additionally, more detail is needed on the implementation and training process of EBGAN. The paper briefly mentions that a variety of architectures and loss functionals can be used for the discriminator, but does not provide specific examples or guidelines for choosing these alternatives. Including experimental results comparing different discriminator architectures would strengthen the paper. Furthermore, the paper claims that the EBGAN framework exhibits more stable behavior than regular GANs during training, but this claim is not backed up with empirical evidence or specific metrics. Without this evidence, it is difficult to fully assess the advantage of using EBGAN over traditional GANs. Lastly, while the paper demonstrates the effectiveness of EBGAN on classification tasks, it would be interesting to see how EBGAN performs on other types of generative tasks such as image synthesis or text generation. Overall, the idea of energy-based GANs is intriguing and has the potential to advance the state-of-the-art, but the paper would benefit from providing more details, comparisons, and experimental evidence to support its claims.","label":187}
{"id":"33273651-89a5-4adb-82e1-d7436c25786b","text":"Hi Junbo,\r\n\r\nThis is an interesting paper with appealing samples and thorough comparisons with normal GAN, but I am not sure what is the motivation behind the motivation behind using an energy-based approach in the EBGAN model. It would be helpful if the authors could provide more clarity on this aspect in the next version of the paper.","label":25}
{"id":"44504293-45b9-45ac-99cf-9c3d12c9caf5","text":"This paper is a parallel work to Improving Generative Adversarial Networks with Denoising Feature Matching..The paper presents the Energy-based Generative Adversarial Network (EBGAN) model, which introduces the discriminator as an energy function attributing low energies to regions near the data manifold and higher energies elsewhere. It proposes an alternative architecture using an auto-encoder with the energy function being the reconstruction error. This form of EBGAN exhibits more stable behavior during training compared to regular GANs. Additionally, the paper demonstrates that a single-scale architecture can be trained to generate high-resolution images. Overall, the research contributes to the advancements in GANs by introducing an energy-based approach and providing insights into training stability and high-resolution image generation.","label":15}
{"id":"d1c8b942-a162-4e44-8c0f-8533f4c6871b","text":"This paper is a parallel work to Improving Generative Adversarial Networks with Denoising Feature In this paper, the authors propose the Energy-based Generative Adversarial Network (EBGAN) model, which introduces the discriminator as an energy function that assigns low energies to regions near the data manifold and higher energies to other regions. This approach allows for the use of a variety of architectures and loss functionals, going beyond the traditional binary classifier. The authors demonstrate the instantiation of EBGAN using an auto-encoder architecture, where the energy is defined as the reconstruction error. They show that this form of EBGAN exhibits more stable behavior during training than regular GANs and can generate high-resolution images with a single-scale architecture.","label":14}
{"id":"99311e67-2ccb-488c-aece-8c1b54e3a550","text":"This paper introduces an energy-based Generative Adversarial Network (GAN) and provides theoretical and empirical results modeling a number of image datasets (including large-scale versions of categories of ImageNet). As far as I know energy-based GANs (EBGAN) were introduced in Kim and Bengio (2016), but the proposed version makes a number of different design choices. \r\n\r\nFirst, it does away with the entropy regularization term that Kim and Bengio (2016) introduced to ensure that the GAN discriminator converged to an energy function proportional to the log density of the data (at optimum). This implies that the discriminator in the proposed scheme will become uniform at convergence as discussed in the theoretical section of the paper, however the introductory text seems to imply otherwise -- that one could recover a meaningful score function from the trained energy-function (discriminator). This should be clarified. \r\n\r\nSecond, this version of the EBGAN setting includes a reconstruction error-based auto-encoder architecture for the energy function. This is an interesting choice as it allows for more flexibility in the architecture and loss function used by the discriminator. The authors provide experimental results showing that this form of EBGAN exhibits more stable behavior during training compared to regular GANs. They also demonstrate that a single-scale architecture can be trained to generate high-resolution images, which is an impressive achievement. However, there are a few points that need clarification. The abstract suggests that the discriminator can use a wide variety of architectures and loss functionals, but it would be helpful to provide more details and examples in the paper. Additionally, it would be beneficial to include a more comprehensive discussion on the limitations and potential drawbacks of the proposed EBGAN model. Overall, this paper contributes to the GAN literature by introducing a novel energy-based approach that offers stability during training and the ability to generate high-resolution images.","label":147}
{"id":"15a61f21-4a98-4d84-8a4c-e573e60d7b63","text":"This paper proposes a novel extension of generative adversarial networks that replaces the traditional binary classifier discriminator with one that assigns a scalar energy to each point in the generator's output domain. The discriminator minimizes a hinge loss while the generator attempts to generate samples with low energy under the discriminator. The authors show that a Nash equilibrium under these conditions yields a generator that matches the data distribution (assuming infinite capacity). Experiments are conducted with the discriminator taking the form of an autoencoder, optionally including a regularizer that penalizes generated samples having a high cosine similarity to other samples in the minibatch.\r\n\r\nPros:\r\n* The paper is well-written.\r\n* The topic will be of interest to many because it sets the stage for the exploration of a wider variety of discriminators than currently used for training GANs.\r\n* The theorems regarding optimality of the Nash equilibrium appear to be well-grounded and add credibility to the proposed method. Furthermore, the experimental section provides valuable insights into the performance of the Energy-based Generative Adversarial Networks (EBGANs). The authors evaluate the EBGAN framework using different architectures and loss functionals for the discriminator. One such instantiation of the EBGAN framework is an auto-encoder architecture, where the energy is defined as the reconstruction error. The results demonstrate that this form of EBGAN exhibits more stable behavior during training compared to regular GANs. Additionally, the authors show that a single-scale architecture can be trained to generate high-resolution images. This finding opens up new possibilities for generating visually impressive and realistic images. The paper also highlights the flexibility of the EBGAN framework by stating that a wide variety of architectures and loss functions can be employed for the discriminator. This is a significant advantage as it allows researchers to explore different approaches and potentially achieve better performance. It would be interesting to see further investigations on the impact of different discriminator architectures and loss functionals on the overall performance of EBGANs. Overall, the paper presents a well-developed method that extends the capabilities of generative adversarial networks. The experimental results provide strong evidence that the EBGAN framework is effective in generating high-quality samples. However, it is recommended to provide more insights into the limitations and potential challenges of the proposed method. Additionally, further comparison with existing state-of-the-art models could strengthen the paper. Nonetheless, the proposed EBGAN framework has the potential to advance the field of generative models and pave the way for further research in this area. ","label":146}
{"id":"06e77de4-07c8-4e31-b116-19f397b0d07e","text":"Hi Junbo,\r\n\r\nThis is an interesting paper with appealing samples and thorough comparisons with normal GAN, but I am not sure what is the motivation behind using hinge loss..It would have been helpful to explicitly state the motivation for using hinge loss in the paper. Additionally, providing more detailed insights into the stability improvements observed in EBGAN compared to regular GANs would enhance the overall understanding of the model.","label":28}
{"id":"20ee0cb7-69f2-4bd0-8238-277c52c15420","text":"This paper presents a semi-supervised technique for \u201cself-ensembling\u201d where the model uses a consensus prediction (computed from previous epochs) as a target to regress to, in addition to the usual supervised learning loss. This has connections to the \u201cdark knowledge\u201d idea, ladder networks work is shown in this paper to be a promising technique for scenarios with few labeled examples (but not only). The paper presents two versions of the idea: one which is computationally expensive (and high variance) in that it needs two passes through the same example at a given step, and a temporal ensembling method that is stabler, cheaper computationally but more memory hungry and requires an extra hyper-parameter. \r\n\r\n\r\nMy thoughts on this work are mostly positive..The self-ensembling technique presented in this paper is a valuable contribution to the field of semi-supervised learning. The idea of using consensus predictions from previous epochs as target labels is intriguing and has shown promising results. The comparison between the computationally expensive version and the temporal ensembling method provides valuable insights into their strengths and weaknesses. It is worth noting that the temporal ensembling method might be more memory hungry and requires an extra hyper-parameter, which could limit its practicality in some scenarios. However, the authors demonstrate that the technique achieves state-of-the-art results on standard benchmarks such as SVHN and CIFAR-10, even outperforming the previous methods. The ability of the proposed method to handle incorrect labels is also an important aspect. Overall, this work presents an innovative approach to semi-supervised learning that improves classification accuracy with limited labeled data. I believe this paper makes a significant contribution to the field and provides a foundation for future research in this area.","label":120}
{"id":"48e93fa8-1faf-4f9c-a4f5-5f656ec7b929","text":"The reviewers all agree that this is a strong, well-written paper that should be accepted to the conference. The reviewers would like to see the authors extend the analysis to larger data sets and investigate the performance of the proposed method on more challenging datasets. Additionally, the reviewers suggest exploring the applicability of self-ensembling to other domains beyond computer vision. Overall, the reviewers believe that this work has significant potential and would greatly benefit the research community in the field of semi-supervised learning.","label":34}
{"id":"baac66ea-6970-47f0-bd2d-6e3869f0e84b","text":"We have uploaded a new revision of the paper that includes additional results from CIFAR-100 and Tiny Images datasets (Section 3.3 + end of Appendix A). Especially the Tiny Images test was quite interesting, and we thank Reviewer #1 for the suggestion. For the sake of consistency, we updated Figure 2 with results from these new experiments. We also made some minor adjustments to the text based on the feedback provided by the reviewers. We believe that these additions and refinements have greatly strengthened the paper and addressed several important points. We would like to express our gratitude to the reviewers for their valuable feedback and suggestions. We believe that the paper is now ready for publication and we hope that the reviewers will agree.","label":54}
{"id":"085bd14e-e63c-4d2a-898a-92b439f747fd","text":"This paper presents a model for semi-supervised learning by encouraging feature invariance to stochastic perturbations of the network and\/or inputs.  Two models are described:  One where an invariance term is applied between different instantiations of the model\/input a single training step, and a second where invariance is applied to features for the same input point across training steps via a cumulative exponential averaging of the features.  These models evaluated using CIFAR-10 and SVHN, finding decent gains of similar amounts in.classification accuracy. The proposed self-ensembling method is a simple and efficient approach for training deep neural networks in a semi-supervised setting. By forming a consensus prediction of the unknown labels using the outputs of the network-in-training under different conditions, the ensemble prediction acts as an improved target for training compared to the output of the most recent epoch. This approach has shown remarkable results in reducing the classification error rate in both SVHN and CIFAR-10 benchmarks. For SVHN with 500 labels, the error rate was reduced from 18.44% to 7.05%, while for CIFAR-10 with 4000 labels, the error rate dropped from 18.63% to 16.55%. These results further improved to 5.12% and 12.16% when standard augmentations were enabled. Additionally, incorporating random images from the Tiny Images dataset as unlabeled extra inputs during training improved CIFAR-100 classification accuracy significantly. Moreover, the proposed method showed good tolerance to incorrect labels, demonstrating its robustness in handling noisy training data. In conclusion, this paper presents a novel approach to semi-supervised learning that achieves state-of-the-art performance on standard benchmarks. The simplicity and efficiency of the self-ensembling method make it a valuable addition to the field, with potential applications in various domains that require training deep neural networks with limited labeled data.","label":83}
{"id":"4a857815-ef61-4efb-8b3e-6186db1f5995","text":"This work explores taking advantage of the stochasticity of neural network outputs under randomized augmentation and regularization techniques to provide targets for unlabeled data in a semi-supervised setting. This is accomplished by either applying stochastic augmentation and regularization on a single image multiple times per epoch and encouraging the outputs to be similar (\u03a0-model) or by keeping a weighted average of past epoch outputs and penalizing deviations of current network outputs from the ensemble prediction (Temporal Ensembling). The authors argue that by leveraging the consensus prediction of the unknown labels using the outputs of the network-in-training on different epochs, and under different regularization and input augmentation conditions, a better predictor for the unknown labels can be obtained compared to the output of the network at the most recent training epoch. This ensemble prediction is then used as a target for training the network. The method proposed by the authors sets new records for two standard semi-supervised learning benchmarks: SVHN with 500 labels and CIFAR-10 with 4000 labels. In SVHN, the classification error rate is reduced from 18.44% to 7.05%, and in CIFAR-10, it is reduced from 18.63% to 16.55%. The authors also demonstrate further improvements by enabling standard augmentations, achieving error rates of 5.12% in SVHN and 12.16% in CIFAR-10. Additionally, the authors conduct experiments on CIFAR-100 and show improved classification accuracy by using random images from the Tiny Images dataset as unlabeled extra inputs during training. Moreover, the proposed method exhibits good tolerance to incorrect labels, which is an important characteristic in real-world scenarios where training data may contain labeling errors. Overall, this work presents a simple yet efficient approach for semi-supervised learning by leveraging the stochasticity of neural network outputs under different training conditions. The results are impressive, setting new records in benchmark datasets, and the method shows promise for further applications in related fields. However, it would be valuable to see more analysis and comparisons with existing semi-supervised learning methods to better understand the strengths and limitations of the proposed technique.","label":72}
{"id":"b64cd26c-f30f-46d6-bcec-b45c470a1076","text":"It has come to our attention that a recent paper titled \"Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning\" by Sajjadi et al., presented at NIPS 2016, builds on the same core principle as our work. We have therefore uploaded a response in the rebuttal. We appreciate the authors' acknowledgment of the similarity between their work and ours and we will carefully consider the findings presented in their paper. We will also update our manuscript to reference their work and compare our results to theirs in a revised version. Additionally, we would like to highlight some key contributions of our method. Firstly, our self-ensembling approach incorporates consensus predictions from different epochs, under various regularization and input augmentation conditions. This ensemble prediction provides a stronger target for training than the output of the network at the most recent epoch. Secondly, we achieved state-of-the-art performance on two widely-used benchmarks, SVHN and CIFAR-10, even outperforming the results of Sajjadi et al. Finally, we demonstrated the effectiveness of our method in handling incorrect labels and showed good tolerance towards them. We believe that these contributions add value to the field of semi-supervised learning and justify the publication of our work.","label":43}
{"id":"97038b33-0993-4d58-b8ae-58ff2779a182","text":"This paper presents a semi-supervised technique for \u201cself-ensembling\u201d where the model uses a consensus prediction (computed from previous epochs) as a target to regress to, in addition to the usual supervised learning loss. This has connections to the \u201cdark knowledge\u201d idea, ladder networks work is shown in this paper to be a promising technique for scenarios with few labeled examples (but not only)..The technique of self-ensembling proposed in this paper is an interesting approach for semi-supervised learning. By incorporating the consensus prediction of the unknown labels using the outputs of the network from different epochs, the model is able to obtain a better predictor for the unknown labels. This ensemble prediction serves as a target for training alongside the supervised learning loss, improving the overall performance of the deep neural networks. The authors demonstrate the effectiveness of their method by setting new records on two standard semi-supervised learning benchmarks, SVHN and CIFAR-10. They achieve significant reductions in the classification error rate, highlighting the potential of self-ensembling in scenarios with limited labeled examples. Moreover, the authors extend their experiments to CIFAR-100, where they incorporate random images from the Tiny Images dataset as unlabeled extra inputs during training. The results show a clear improvement in CIFAR-100 classification accuracy, further validating the capability of the proposed approach. Additionally, the model exhibits good tolerance to incorrect labels, which is a valuable trait in real-world data scenarios. Overall, this paper presents a promising method for semi-supervised learning and provides insights into the benefits of self-ensembling and its potential applications beyond limited labeled examples.","label":63}
{"id":"e434f84b-aa98-4254-bad8-ed22500fe683","text":"Out of  the papers I have reviewed so far, I find this paper to be a promising approach for semi-supervised learning. The method of self-ensembling and using consensus predictions appears to be a simple yet effective way to improve classification accuracy in scenarios with limited labeled data. The achieved reduction in classification error rates on standard benchmarks such as SVHN and CIFAR-10 demonstrates the potential of this approach. Additionally, the improvement in CIFAR-100 classification accuracy using the Tiny Images dataset as unlabeled extra inputs is a notable finding. The paper also addresses the issue of tolerance to incorrect labels, which is an important consideration in real-world applications. Overall, this paper presents a valuable contribution to the field of semi-supervised learning.","label":2}
{"id":"26525e54-d4bb-47a0-b7f6-a4e648501467","text":"This paper presents a semi-supervised technique for \u201cself-ensembling\u201d where the model uses a consensus prediction (computed from previous epochs) as a target to regress to, in addition to the usual supervised learning loss. This has connections to the \u201cdark knowledge\u201d idea, ladder networks work is shown in this paper to be a promising technique for scenarios with few labeled examples (but not only). The paper presents two versions of the idea: one which is computationally expensive (and high variance) in that it needs two passes through the same example at a given step, and a temporal ensembling method that is stabler, cheaper computationally but more memory hungry and requires an extra hyper-parameter. \r\n\r\n\r\nMy thoughts on this work are mostly positive..The self-ensembling technique presented in this paper seems to be a promising approach for semi-supervised learning in scenarios with limited labeled examples. The idea of using a consensus prediction from previous epochs as a target for regression is interesting and draws connections to the concept of 'dark knowledge'. The paper provides two versions of the approach, one being computationally expensive but yielding more accurate results, and the other being computationally cheaper but requiring additional memory and a hyper-parameter. This flexibility in options allows the method to be adapted to different computational constraints and trade-offs. However, it would have been useful to provide a more detailed analysis of the computational requirements and trade-offs between the two versions. Overall, the experimental results presented in the paper demonstrate the effectiveness of the self-ensembling technique, as it achieves new state-of-the-art results on standard semi-supervised learning benchmarks. Additionally, the method shows good tolerance to incorrect labels, which is a valuable characteristic for real-world applications. The paper could benefit from further addressing the limitations and potential extensions of the proposed technique.","label":120}
{"id":"b4da9cf5-f25d-406b-a637-afe9914c783e","text":"The reviewers all agree that this is a strong, well-written paper that should be accepted to the conference. The reviewers would like to see the authors extend the analysis to larger data sets and extend the variety of augmentations..The reviewers all agree that this is a strong, well-written paper that should be accepted to the conference. They appreciate the clear improvement in CIFAR-100 classification accuracy and urge the authors to extend the analysis to larger datasets and explore a wider range of augmentations for a more comprehensive evaluation.","label":39}
{"id":"95bf02cf-1acb-450f-8ffb-01d7566b80f8","text":"We have uploaded a new revision of the paper that includes additional results from CIFAR-100 and Tiny Images datasets (Section 3.3 + end of Appendix A). Especially the Tiny Images test was quite interesting, and we thank Reviewer #1 for their valuable feedback. However, we recommend further improvements for the paper, specifically in terms of experimental methodology. More detailed explanations and justifications of the chosen regularization and input augmentation conditions are necessary to validate the effectiveness of the proposed self-ensembling method. Additionally, clearer explanations of the training process and implementation details would enhance the reproducibility of the results. Overall, the addition of results from CIFAR-100 and Tiny Images datasets adds significant value to the paper, but addressing these points would further strengthen the research contribution.","label":39}
{"id":"82cde8ef-ce4b-4aa7-a0f7-e8571c2d7d64","text":"This paper presents a model for semi-supervised learning by encouraging feature invariance to stochastic perturbations of the network and\/or inputs.  Two models are described:  One where an invariance term is applied between different instantiations of the model\/input a single training step, and a second where invariance is applied to features for the same input point across training steps via a cumulative exponential averaging of the features.  These models evaluated using CIFAR-10 and SVHN, finding decent gains of similar amounts in each case.  An additional application is also explored at the end, showing some tolerance to corrupted labels as well.\r\n\r\nThe authors also discuss recent work by Sajjadi &al that is very similar in spirit, which I think helps corroborate the findings presented in this paper. The authors acknowledge the limitations of their method, such as the sensitivity to the choice of hyperparameters and the lack of thorough analysis on larger datasets. However, they provide insightful discussions on potential directions for future research, including the exploration of different ensembling techniques, the investigation of additional regularization methods, and the extension of the approach to other domains beyond image classification. Overall, this paper makes a significant contribution to the field of semi-supervised learning by proposing a simple yet effective method, self-ensembling, that achieves impressive results on standard benchmarks. The experimental evidence is compelling, and the thorough analysis of the proposed approach adds credibility to the findings. The paper is well-written, and the research is clearly explained, making it accessible even to readers with limited background in deep learning. The inclusion of the comparison to related work further strengthens the importance of the contributions made by the authors. In conclusion, I believe this paper makes a valuable contribution to the field and should be accepted for publication. ","label":123}
{"id":"58898c10-15e0-4dc7-ba6e-650817e55121","text":"This work explores taking advantage of the stochasticity of neural network outputs under randomized augmentation and regularization techniques to provide targets for unlabeled data in a semi-supervised setting. This is accomplished by either applying stochastic augmentation and regularization on a single image multiple times per epoch and encouraging the outputs to be similar (\u03a0-model) or by keeping a weighted average of past epoch outputs and penalizing deviations of current network outputs from this running mean (temporal ensembling). The core argument is that these approaches produce ensemble predictions which are likely more accurate than the current network and are thus good targets for unlabeled data. Both approaches seem to work quite well on semi-supervised tasks and some results show that they are almost unbelievably robust to label noise.\r\n\r\nThe paper is clearly written and provides sufficient details to reproduce these results in future experiments. However, there are a few areas that could be further discussed in the paper. First, the authors mention that they use a weighted average of the last several training epochs' predictions to form the target for unlabeled data, but it would be helpful to have more information on how the weights are determined. Are all past epochs given equal weight, or are recent epochs weighted more heavily? Additionally, it would be interesting to explore the impact of different regularization techniques and augmentation strategies on the performance of the proposed method. The paper mentions that they use various combinations of regularization and augmentation, but it would be valuable to have a more detailed analysis of how different choices affect the results. Lastly, the authors demonstrate good tolerance to incorrect labels, but it would be beneficial to understand the limits of this tolerance. How much label noise can the method handle while still maintaining good accuracy? Overall, this paper presents a simple and efficient method for semi-supervised learning that achieves impressive results on standard benchmarks. The approach of using ensemble predictions as targets for unlabeled data is compelling, and the experiments performed demonstrate the effectiveness of the proposed method. With some additional discussion and analysis in key areas, this paper could be even stronger. ","label":140}
{"id":"a8b6b906-90d7-41e0-8162-8cb57e7ff01b","text":"It has come to our attention that a recent paper titled \"Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning\" by Sajjadi et al., presented at NIPS 2016, builds on the same core principle as our work. We have therefore uploaded a new revision of our paper to cite this related work and to contrast our contributions We have included the citation and comparison to this related work in the revised version of our paper. We believe that our method of self-ensembling, where a consensus prediction of unknown labels is formed using different epoch outputs, provides a strong foundation for semi-supervised learning. Our approach achieves significant improvements in classification error rates for SVHN and CIFAR-10 benchmarks, as well as a clear improvement in CIFAR-100 accuracy by utilizing unlabeled extra inputs. Furthermore, our method demonstrates good tolerance to incorrect labels. We aim to further explore the potential of self-ensembling in future research and extend its applicability to other domains.","label":58}
{"id":"1dd5475d-36a6-49c2-9f3c-9601abd5f87d","text":"This paper presents a semi-supervised technique for \u201cself-ensembling\u201d where the model uses a consensus prediction (computed from previous epochs) as a target to regress to, in addition to the usual supervised learning loss. This has connections to the \u201cdark knowledge\u201d idea, ladder networks work is shown in this paper to be a promising technique for scenarios with few labeled examples (but not only). The paper presents two versions of the idea: one which is computationally expensive (and high variance) in that it needs two passes through the same example at a given step, and a temporal ensembling method that is stabler, cheaper computationally but more memory hungry and requires an extra hyper-parameter. \r\n\r\n\r\nMy thoughts on this work are mostly positive. The drawbacks that I see are that the temporal ensembling work requires potentially a lot of memory, and non-trivial infrastructure \/ book-keeping for imagenet-sized experiments..The authors propose a simple and efficient method for training deep neural networks in a semi-supervised setting with limited labeled data. They introduce the concept of self-ensembling, where a consensus prediction is formed using the outputs of the network on different epochs under various regularization and input augmentation conditions. This ensemble prediction serves as a target for training and has shown to be a better predictor than the output of the network at the most recent training epoch. The results on standard semi-supervised learning benchmarks are impressive, with significant reductions in classification error rates. However, the computational expense and memory requirements of the temporal ensembling method might limit its application on larger datasets. Additionally, the paper lacks detailed discussion on the selection of hyper-parameters. Overall, this work presents a promising approach to semi-supervised learning and highlights the importance of leveraging unlabeled data to improve model performance.","label":145}
{"id":"998c0380-c1c2-4272-bd9b-4ba8022fa2cf","text":"Out of curiosity: did you perform any ablation studies to understand the impact of different regularization and input augmentation conditions on the performance of the self-ensembling method?","label":5}
{"id":"6d5724cc-d83d-431f-8678-302d1f4aca6e","text":"In this paper, the authors propose a new method to learn hierarchical representations of sentences, based on reinforcement learning. They propose to learn a neural shift-reduce parser, such that the induced tree structures lead to good performance on a downstream task. They use reinforcement learning (more specifically, the policy gradient method REINFORCE) to learn their model. The reward of the algorithm is the evaluation metric of the downstream task. The authors compare two settings, (1) no structure information is given (hence, the only supervision comes from the downstream task) and (2) actions from an external parser is used as supervision to train the policy network, in addition to the supervision from the downstream task..The authors conduct experiments to evaluate the effectiveness of their proposed method. They compare the performance of their model with sequential encoders and recursive encoders that are based on treebank annotations. The results show that learning task-specific composition orders using reinforcement learning outperforms both sequential encoders and recursive encoders. This suggests that optimizing tree structures through reinforcement learning can lead to improved performance on downstream tasks. Moreover, the authors analyze the induced trees and find that they discover some linguistically intuitive structures, such as noun phrases and simple verb phrases. However, they also note that these structures differ from conventional English syntactic structures. This finding implies that the model is able to capture meaningful representations of sentences, even if they do not align perfectly with traditional syntax. Overall, the paper presents an innovative approach to learning hierarchical representations of sentences using reinforcement learning. The experimental results provide strong evidence of the effectiveness of the proposed method. The analysis of the induced trees enhances our understanding of the learned representations. Future research can explore further improvements in the model's performance and investigate the potential applications of such learned representations in other natural language processing tasks.","label":114}
{"id":"a0894c65-d282-4752-96da-d777faf8b3d5","text":"All the reviewers agreed that the research direction is very interesting, and generally find the results promising. We could quibble a bit about the results not being really state-of-the-art and the choice of baselines, but I think the main claims are well supported by the evidence. The use of reinforcement learning to learn tree-structured neural networks for computing representations of natural language sentences is a novel approach that sets this research apart. The paper clearly explains the motivation behind their method and the experimental results provide strong evidence for the effectiveness of learning task-specific composition orders. Additionally, the analysis of the induced trees and their differences from conventional English syntactic structures adds valuable insights to the field. Overall, this paper presents a significant contribution to the area of tree-structured models for natural language processing, and it is worth considering for publication.","label":45}
{"id":"1570061f-09cd-4a04-9204-92cb34bb88ef","text":"I have not much to add to my pre-review comments.\r\nIt's a very well written paper with an interesting idea.\r\nLots of people currently want to combine RL with NLP. It is very en vogue.\r\nNobody has gotten around to applying it to sentence composition, so this paper fills a gap in the current research. The authors clearly state their motivation and objectives, and they provide a thorough explanation of the methodology used. The experiments are well-designed and the results show a clear advantage of using reinforcement learning for learning task-specific composition orders. The analysis of the induced trees is a valuable contribution, as it highlights both the strengths and limitations of the approach. However, it would be beneficial to further explore the reasons behind the differences between the induced trees and conventional English syntactic structures. Additionally, it would be interesting to evaluate the performance of the proposed approach on other downstream tasks to assess its generalizability. Overall, this paper makes a valuable contribution to the field and I would recommend it for publication.","label":35}
{"id":"b848d60f-7e02-487d-a376-ede76e7947f5","text":"The paper proposes to use reinforcement learning to learn how to compose the words in a sentence, i.e. parse tree, that can be helpful for the downstream tasks. To do that, the shift-reduce framework is employed and RL is used to learn the policy of the two actions SHIFT and REDUCE. The experiments on four datasets (SST, SICK, IMDB, and SNLI) show that the proposed approach outperformed the approach using predefined tree structures (e.g. left-to-right, right-to-left). \r\n\r\nThe paper is well written and has two good points..However, there are some aspects that could be further improved. Firstly, the paper could provide more details on the specific reinforcement learning algorithm used and how the policy for SHIFT and REDUCE actions was learned. Additionally, the paper could benefit from a more thorough analysis of the induced trees and their comparison with conventional English syntactic structures. While it is mentioned that the induced trees discover some linguistically intuitive structures, it would be useful to have examples and further insights into the differences. Moreover, the paper only focuses on the comparison with sequential encoders and recursive encoders based on treebank annotations, but it would be interesting to see how the proposed approach performs against other state-of-the-art parsing methods. Lastly, the paper could also discuss the limitations and potential challenges of using reinforcement learning for parsing tasks. Overall, the paper presents a promising approach for learning composition orders and improving performance on downstream tasks, but further clarity and analysis are needed to strengthen the findings and contribute to the existing literature.","label":86}
{"id":"a3fead18-8cbd-4603-9ee3-d59a91a38af2","text":"In this paper, the authors propose a new method to learn hierarchical representations of sentences, based on reinforcement learning. They propose to learn a neural shift-reduce parser, such that the induced tree structures lead to good performance on a downstream task. They use reinforcement learning (more specifically, the policy gradient method REINFORCE) to learn their model. The reward of the algorithm is the evaluation metric of the downstream task. The authors compare two settings, (1) no structure information setting in which the model does not have access to any tree structure information, and (2) a setting in which the model is provided with explicit treebank annotations. The authors demonstrate that their approach outperforms both sequential encoders and recursive encoders in terms of performance on the downstream task. They also conduct an analysis of the induced trees and find that while these trees capture some linguistically intuitive structures such as noun phrases and simple verb phrases, they differ from conventional English syntactic structures.\r\n\r\nThe use of reinforcement learning in this work is a novel and interesting approach to learning tree-structured neural networks for sentence representation. By optimizing the tree structures to improve performance on a specific task, the authors show that their method can outperform previous approaches that either use fixed tree structures or rely on supervised learning to predict tree structures from annotated data.\r\n\r\nOne strength of this paper is the thorough evaluation of the proposed approach. The authors compare their method with several baselines, including sequential encoders and recursive encoders, and show statistically significant improvements in performance. Furthermore, the authors provide detailed analysis of the induced trees, which adds insights into the capabilities and limitations of their model.\r\n\r\nHowever, there are some areas that could be further expanded upon. Firstly, the authors mention that their induced trees are different from conventional English syntactic structures, but they do not provide a detailed analysis or explanation of why this is the case. It would be beneficial to have more discussion on the reasons behind these differences and how they impact the model's performance.\r\n\r\nAdditionally, the authors could provide more insights into the limitations and challenges of using reinforcement learning for learning tree structures. While the results are promising, there might be scenarios where the model struggles to learn effective tree structures, and it would be interesting to understand these scenarios better.\r\n\r\nOverall, this paper presents a valuable contribution to the field of sentence representation. By leveraging reinforcement learning, the authors demonstrate the efficacy of learning task-specific composition orders and show improvements over existing approaches. With some further discussions and analysis, this work has the potential to make even stronger contributions.","label":78}
{"id":"ef80e065-9e0b-493c-af82-0b211af467c2","text":"It was slow to train the model because you had to build a different new computational graph for each sentence..However, despite the slow training process due to the need for building unique computational graphs for each sentence, the paper provides valuable insights into the use of reinforcement learning for tree-structured neural networks in sentence representation computation. The experiments demonstrate the advantages of learning task-specific composition orders over sequential and recursive encoders, as well as the discovery of linguistically intuitive structures. Furthermore, the mention of deviations from conventional English syntactic structures opens up possibilities for further exploration and understanding of language processing.","label":20}
{"id":"61d169b3-4167-467f-8084-a21a4342c3de","text":"Dear authors,\r\n\r\nI've read the interesting paper and learned nice ideas.\r\n\r\nNow I've found somewhat incorrect mention in your paper.\r\nIn Introduction, it is said that trees are provided with sentences in this work, but in the experimental results section, it is mentioned that the tree structures are optimized to improve performance on a downstream task. I would suggest clarifying this contradiction and providing more details on how the tree structures are actually obtained. Additionally, it would be interesting to see a more thorough analysis of the induced trees and their differences from conventional syntactic structures. Overall, the paper presents an innovative approach to learning tree-structured neural networks for sentence representation and has the potential to significantly contribute to the field of natural language processing.","label":29}
{"id":"3d0369e8-879c-4b36-bf3e-c2eb8aade32e","text":"In several of the evaluation, the \"constituency tree LSTM\" and\/or \"dependency tree LSTM\" methods perform much better than all of your proposed models, including the \"supervised syntax\" model, and the \"supervised syntax\" model, and consistently outperform the baseline models. This suggests that the reinforcement learning approach in this paper effectively learns task-specific composition orders and improves performance on downstream tasks. Additionally, the analysis of the induced trees reveals that while they capture some linguistically intuitive structures, they differ from conventional English syntactic structures.","label":29}
{"id":"89747d88-3bb9-4f2f-bd67-125820ba0e7a","text":"Big fan of this work. Related to the above comment, have you guys tried a simple baseline where you just assume a  fixed composition order, such as depth-first or breadth-first, instead of optimizing the tree structures using reinforcement learning to improve downstream task performance? It would be interesting to see how this compares to the approach proposed in the paper.","label":22}
{"id":"a046b6df-7453-490f-b0e9-04b24d959640","text":"Table 2 is missing some recent results on this task. Please see the NTI and NSE NTI and NSE are two recent models that have shown promising results on this task. Including the results from these models in Table 2 would provide a more comprehensive overview of the current state-of-the-art. Additionally, it would be helpful to provide a discussion on the limitations of the proposed approach. While the induced trees seem to capture some linguistic structures, it would be interesting to explore why they differ from conventional English syntactic structures. Overall, this paper presents an innovative approach using reinforcement learning for sentence representation learning, and with some minor additions and further analysis, it can make valuable contributions to the field.","label":16}
{"id":"654e0a57-fe60-466a-9c1d-2536413154d9","text":"In this paper, the authors propose a new method to learn hierarchical representations of sentences, based on reinforcement learning. They propose to learn a neural shift-reduce parser, such that the induced tree structures lead to good performance on a downstream task. They use reinforcement learning (more specifically, policy gradient algorithms) to optimize the parser's action choices during the tree construction process. The experiments conducted by the authors demonstrate the effectiveness of their approach in learning task-specific composition orders. The proposed method outperforms both sequential encoders and recursive encoders that rely on treebank annotations. This is a notable achievement as it suggests that the induced tree structures are specifically tailored to improve performance on downstream tasks, rather than being constrained by pre-defined syntactic structures. The authors also analyze the induced trees and find that they discover some linguistically intuitive structures such as noun phrases and simple verb phrases. However, they note that these structures differ from conventional English syntactic structures. The findings indicate that the proposed method has the potential to capture non-trivial linguistic patterns and may provide insights into the underlying processes of sentence composition. Overall, this paper presents a novel approach to learning hierarchical representations of sentences through reinforcement learning. The results are promising, showing improved performance compared to existing methods while also uncovering unconventional syntactic structures. The clear and well-structured presentation of the paper makes it easy to understand the proposed approach and the reasoning behind the experiments. However, it would be beneficial to provide a more detailed discussion on the limitations and potential future directions of this research. Additionally, conducting further analysis on why the induced tree structures differ from conventional syntactic structures would enhance the depth of the study. Nonetheless, this work is a commendable contribution to the field and paves the way for future research in the intersection of reinforcement learning and sentence composition.","label":47}
{"id":"d1b870e3-7cab-4380-b385-20bf798cc85f","text":"All the reviewers agreed that the research direction is very interesting, and generally find the results promising. We could quibble a bit about the results not being really state-of-the-art and the choice of baselines, but I think the main claims are well supported by the experiments (i.e. the induce grammar appears to be useful for the problem m at hand). The paper provides a clear explanation of the proposed approach and presents detailed experimental results to validate its effectiveness. The analysis of the induced trees and their comparison with conventional English syntactic structures adds valuable insights to the study. Overall, this paper makes a significant contribution to the field of natural language processing and reinforcement learning. However, it would be beneficial to include a discussion on potential limitations and future directions for further research.","label":57}
{"id":"8442c8e7-5317-4738-80bd-3a209cc30f39","text":".In this paper, the authors propose a novel approach to learning tree-structured neural networks for computing representations of natural language sentences using reinforcement learning. The key contribution of this work is the optimization of tree structures to enhance performance on downstream tasks, contrary to prior research where the trees are predetermined or generated through supervised treebank annotations. The experiments conducted in this study clearly illustrate the advantages of learning task-specific composition orders, which outperform both sequential encoders and recursive encoders based on treebank annotations. This finding demonstrates that the optimized tree structures improve the overall performance of the model in sentence composition. Moreover, the researchers conduct an analysis of the induced trees and reveal that although these structures do uncover some linguistically intuitive components such as noun phrases and simple verb phrases, they deviate from conventional English syntactic structures. This finding suggests that the optimized trees favor alternative linguistic representations that may provide new insights into the organization and composition of sentences. Overall, this paper introduces a valuable contribution to the field of natural language processing, highlighting the significance of reinforcement learning in the generation of optimized tree structures for sentence understanding.","label":0}
{"id":"a014f5ed-fe72-4e68-87b0-b8a0accb7898","text":"The paper proposes to use reinforcement learning to learn how to compose the words in a sentence, i.e. parse tree, that can be helpful for the downstream tasks. To do that, the shift-reduce framework is employed and RL is used to learn the policy of the two actions SHIFT and REDUCE. The experiments on four datasets (SST, SICK, IMDB, and SNLI) show that the proposed approach outperformed the approach using predefined tree structures (e.g. left-to-right, right-to-left). \r\n\r\nThe paper is well written and has two good points. Firstly, the idea of using RL to learn parse trees using downstream tasks is very interesting and novel. And employing the shift-reduce framework is a very smart choice because the set of actions is minimal (shift and reduce)..Secondly, the experiments conducted on different datasets (SST, SICK, IMDB, and SNLI) provide a comprehensive evaluation of the proposed approach. The results clearly show that the approach outperforms methods using predefined tree structures such as left-to-right or right-to-left. This highlights the effectiveness and flexibility of using reinforcement learning to optimize the composition order of words in a sentence. Additionally, the analysis of the induced trees is a valuable contribution of this work. The paper demonstrates that while the induced trees capture some linguistically intuitive structures like noun phrases and simple verb phrases, they differ from conventional English syntactic structures. This finding suggests that the learned trees capture task-specific composition patterns rather than relying on pre-defined linguistic rules. Overall, the paper presents an innovative approach to learning composition orders using reinforcement learning, with compelling experimental results and insightful analysis.","label":124}
{"id":"77dc9b82-076f-4320-bfd2-3996bde716ac","text":"In this paper, the authors propose a new method to learn hierarchical representations of sentences, based on reinforcement learning. They propose to learn a neural shift-reduce parser, such that the induced tree structures lead to good performance on a downstream task. They use reinforcement learning (more specifically, the policy gradient method REINFORCE) to learn their model. The reward of the algorithm is the evaluation metric of the downstream task. The authors compare two settings, (1) no structure information is given (hence, the only supervision comes from the downstream task) and (2) actions from an external parser is used as supervision to train the policy network, in addition to the supervision from the downstream task. The proposed approach is evaluated on four tasks: sentiment analysis, semantic relatedness, textual entailment and sentence generation.\r\n\r\nI like the idea of learning tree representations of text which are useful for a downstream task. The paper is clear and well written. However, I am not convinced by the experimental results presented in the paper..While the idea of learning tree structures for text representation is intriguing, the experimental results presented in the paper fail to adequately support the claims made by the authors. The comparison between the two settings, where no structure information is given and where actions from an external parser are used as supervision, is interesting. However, the authors do not provide enough evidence to demonstrate that the induced trees effectively improve performance on the downstream tasks. It is unclear whether the improvement can be attributed solely to the learned composition orders or if other factors contribute to the results. Additionally, the analysis of the induced trees falls short in providing a comprehensive understanding of their relationship to conventional English syntactic structures. The discovery of some linguistically intuitive structures, such as noun phrases and simple verb phrases, is promising. Nevertheless, a more in-depth analysis and comparison with established syntactic structures would strengthen the paper's argument. Overall, the paper is well-written and presents a novel approach, but it lacks sufficient empirical evidence and analysis to fully support its claims.","label":167}
{"id":"d6fa46ab-00a0-4eb4-8805-aadb823ee776","text":"It was slow to train the model because you had to build a different new computational graph for each sentence..However, despite the slow training process due to the need for building a new computational graph for each sentence, the paper demonstrates the advantage of using reinforcement learning to optimize tree structures for improving performance on downstream tasks. It goes beyond previous work on tree-structured models that rely on input or supervision from treebank annotations, and instead proposes learning task-specific composition orders. Moreover, the analysis of the induced trees reveals interesting findings, showing that they capture some linguistically intuitive structures, albeit different from conventional English syntactic structures.","label":20}
{"id":"77ce03ca-348e-4f65-a656-927f4858ca1c","text":"Dear authors,\r\n\r\nI've read the interesting paper and learned nice ideas.\r\n\r\nNow I've found somewhat incorrect mention in your paper.\r\nIn Introduction, it is Now I've found somewhat incorrect mention in your paper. In the Introduction, it is important to provide a clear overview of the problem being addressed and the motivation behind your approach. However, the mention seems to be misleading in the context of the rest of the paper. I suggest revisiting this section to ensure clarity and coherence with the subsequent sections.","label":21}
{"id":"4bcef431-7a52-4fb6-a371-a094acdb18a8","text":"In several of the evaluation, the \"constituency tree LSTM\" and\/or \"dependency tree LSTM\" methods perform much better than all of your proposed models, including the baseline models such as sequential encoders and recursive encoders based on treebank annotations. This suggests that employing reinforcement learning to optimize the tree structures leads to improved performance on downstream tasks, highlighting the potential of this approach in the field of natural language processing.","label":25}
{"id":"c57067b8-73cd-4d5e-984a-8748a77147bf","text":"Big fan of this work. Related to the above comment, have you guys tried a simple baseline where you just use supervised learning to predict the tree structures, rather than optimizing them through reinforcement learning? It would be interesting to see how the performance compares between the two approaches.","label":20}
{"id":"90238f11-a1d7-4ef5-9ac4-b4c2ec551b87","text":"Table 2 is missing some recent results on this task. Please see the NTI and NSE results on the same task [1,2]. NTI is particularly relevant to this work because it evaluates the performance of tree-structured models on the same task using the treebank annotations. Including these results would provide a comprehensive comparison and strengthen the analysis of the induced trees in this paper.","label":30}
{"id":"4203bb44-c96d-490a-9edb-8b5248b19239","text":"The paper attempts to combine Variational Auto-Encoders with the Stick-Breaking process. The motivation is to tackle the component collapsing and have a representation with stochastic dimensionality. To demonstrate the merit of their To demonstrate the merit of their approach, the authors perform several experiments comparing the SB-VAE with a Gaussian VAE. They evaluate the models on various datasets and tasks, including unsupervised and semi-supervised learning. The results show that the SB-VAE consistently outperforms the Gaussian VAE by achieving higher accuracy and better discriminative latent representations. These findings suggest that incorporating the Stick-Breaking process into the VAE framework enhances its ability to capture complex data distributions and generate informative latent representations.\r\n\r\nOne notable advantage of the SB-VAE is its ability to automatically determine the number of latent dimensions. The Stick-Breaking process allows for infinite possibilities, and the model can learn to allocate more or fewer dimensions to different latent factors based on the data. This flexibility is particularly useful in scenarios where the true dimensionality of the data is uncertain or may vary across instances. By dynamically adjusting the dimensionality of the latent space, the SB-VAE can adapt to different complexities without requiring manual tuning.\r\n\r\nThe experimental evaluation also includes a semi-supervised variant of the SB-VAE, where labeled data is leveraged to guide the learning process. The authors demonstrate that the semi-supervised SB-VAE achieves superior performance in tasks that involve both labeled and unlabeled data compared to its Gaussian VAE counterpart. This result suggests that the incorporation of Bayesian nonparametric techniques, such as the Stick-Breaking process, can effectively leverage labeled information to improve the quality of the learned representations and enhance the discriminative power of the model.\r\n\r\nOverall, the paper makes a solid contribution to the field of Bayesian nonparametric modeling in the context of variational autoencoders. The proposed SB-VAE successfully addresses the limitations of traditional VAEs, such as component collapsing and fixed dimensionality, by leveraging the Stick-Breaking process. The experimental results validate the effectiveness of the SB-VAE in learning highly discriminative latent representations and demonstrate its superiority over the Gaussian VAE. This work opens up exciting opportunities for further exploration and application of Bayesian nonparametric techniques in deep generative models.","label":32}
{"id":"f4e2a29f-6a05-46e9-9dc8-679a2365b560","text":"This paper will  present an extension of Stochastic Gradient Variational Bayes that performs posterior inference for the weights of Stick-Breaking processes, enabling the definition of a Stick-Breaking Variational Autoencoder (SB-VAE). The SB-VAE is a Bayesian nonparametric version of the variational autoencoder with a latent representation having stochastic dimensionality. Experimental results demonstrate that the SB-VAE, along with its semi-supervised variant, learn highly discriminative latent representations often superior to those of Gaussian VAEs.","label":3}
{"id":"c7f6fa0b-956d-42fa-a583-03f1607f499f","text":"This paper presents an approach which modifies the variational auto-encoder (VAE) framework so as to use stochastic latent dimensionality. This is achieved by using an inherently infinite prior, the stick-breaking process. This is coupled with inference tailored to this model, specifically the Kumaraswamy distribution as an approximate variational posterior. The resulting model is named the SB-VAE which also has a semi-supervised extension, in similar vein to the original VAE paper.\r\n\r\nThere's a lot of interest in VAEs these days; many lines of work seek to achieve automatic \"black-box\" inference in these models. For example, the authors themselves mention parallel work by Blei's lab (also others) towards this direction. However, there's a lot of merit in investigating more bespoke solutions to new models, which is what the authors are doing in this paper. Indeed, a (useful) side-effect of providing efficient inference for the SB-VAE is drawing attention to the use of the Kumaraswamy distribution which hasn't been popular in ML.\r\n\r\nAlthough the paper is in general well-written and provides a clear motivation for the proposed approach, I believe there are several areas that could be improved to make the paper more impactful. Firstly, while the authors briefly mention the use of the Kumaraswamy distribution as an approximate variational posterior, more details about its advantages and limitations could be provided. This would help readers understand the choice of this particular distribution and evaluate its suitability for other applications. Additionally, the paper could benefit from a more comprehensive evaluation of the proposed SB-VAE model. While the authors mention that it often outperforms Gaussian VAEs in terms of discriminative latent representations, they should provide more detailed comparisons and analysis, such as on different datasets or in comparison to other state-of-the-art methods. This would further validate the effectiveness of the SB-VAE approach and provide readers with insights into its strengths and limitations. Furthermore, the authors could discuss potential applications and future directions for the SB-VAE model. Given the unique properties of the Stick-Breaking processes and the stochastic dimensionality of the latent representation, it would be interesting to explore how this model can be applied in various domains like anomaly detection or generative modeling. Overall, the paper presents a novel approach with promising results, but further work is needed to address the above-mentioned suggestions and enhance its impact.","label":163}
{"id":"d44593a2-04f5-497b-9088-366adbd3fa1e","text":"The paper attempts to combine Variational Auto-Encoders with the Stick-Breaking process. The motivation is to tackle the component collapsing and have a representation with stochastic dimensionality. To demonstrate the merit of their approach, the authors test this model on MNIST and SVHN in an unsupervised and semi-supervised fashion.\r\nAfter reading the paper in more detail, I find that the claim that the dimensionality of the latent variable is stochastic does not seem quite correct: all latent variables are \"used\" (which actually enable backpropagation) but the latent variables are parametrized differently (into $\\pi$) and the decoding process is altered as to give the impression of sparsity. The way all these latent variables are parametrized differently (into \u03c0) and the decoding process is altered as to give the impression of sparsity. The way all these latent variables are handled makes the model more flexible and able to capture complex data distributions. The authors provide a clear explanation of the model architecture and the variational inference method used to train it. They also provide experiments on two popular datasets, MNIST and SVHN, in both unsupervised and semi-supervised settings to evaluate the performance of the proposed SB-VAE. The experimental results demonstrate that the SB-VAE can learn highly discriminative latent representations that often outperform the Gaussian VAE. The evaluation metrics used to compare the models are appropriate and the results are presented in a clear and organized manner. However, there are a few points that could be improved. Firstly, the paper could benefit from a more detailed discussion on the theoretical motivation behind using the Stick-Breaking process in the VAE framework. Additionally, more details on the choice of hyperparameters and their impact on the model performance would be useful. Overall, the paper presents an interesting extension to VAEs and provides strong empirical evidence for the effectiveness of the proposed SB-VAE.","label":110}
{"id":"811c7600-23d0-42b0-9746-51aa6c12d50f","text":"Very Interesting work for extending VAE towards Bayesian non-parametric! \r\n\r\nI wonder what is the motivation of constraining \\pi The motivation behind constraining \u03c0 could be to control the complexity of the model and prevent arbitrary growth of the latent space. By using a stick-breaking process, the SB-VAE can have a flexible but controlled number of latent dimensions, allowing for efficient learning of highly discriminative latent representations.","label":18}
{"id":"2c419961-b334-472f-9fbd-f52799a34b3f","text":"Summary: This is the first work to investigate stick-breaking priors, and corresponding inference methods, for use in VAEs. The background material is explained clearly, as well as the explanation of the priors and posteriors and their DNCP forms. The paper is really well written.\r\n\r\nIn experiments, they find that stick-breaking priors does not generally improve upon spherically Gaussian priors in the completely unsupervised setting, when measured against two different metrics. However, in the semi-supervised setting, where a portion of the data is labeled, the stick-breaking priors show significant improvements in terms of discriminative power and generalization. The authors conduct thorough experiments on several benchmark datasets, comparing the performance of the SB-VAE with the Gaussian VAE and other state-of-the-art methods. The results clearly demonstrate the superiority of the SB-VAE, both in terms of achieving higher accuracy and producing more meaningful latent representations. The experiments are well-designed and the results are presented in a clear and concise manner. Additionally, the authors provide insightful discussions on the experimental results, highlighting the advantages and limitations of the proposed method. In conclusion, this paper makes a valuable contribution to the field of variational autoencoders by introducing the stick-breaking priors and demonstrating their effectiveness in learning discriminative latent representations.","label":65}
{"id":"9a17731c-082d-4847-b27f-78784cf828e7","text":"The paper attempts to combine Variational Auto-Encoders with the Stick-Breaking process. The motivation is to tackle the component collapsing and have a representation with stochastic dimensionality..The paper presents an interesting extension to Stochastic Gradient Variational Bayes by incorporating Stick-Breaking processes to perform posterior inference for the weights. The primary objective is to address the issue of component collapsing and introduce a latent representation with stochastic dimensionality. By combining the Stick-Breaking process with Variational Auto-Encoders (VAEs), the authors propose a novel framework called Stick-Breaking Variational Autoencoder (SB-VAE).The motivation behind this work is to overcome the limitations of Gaussian VAEs, specifically their inability to capture complex data distributions and generate diverse samples. The authors argue that by leveraging the Bayesian nonparametric properties of Stick-Breaking processes, the SB-VAE can learn highly discriminative latent representations, which in turn lead to improved performance compared to traditional Gaussian VAEs.To evaluate the effectiveness of the SB-VAE, the authors conduct extensive experiments and compare it to Gaussian VAEs on various benchmarks. The results clearly demonstrate that the SB-VAE, and its semi-supervised variant, consistently achieve superior performance in terms of both reconstruction quality and downstream tasks. The latent representations learned by the SB-VAE exhibit better separation between classes, indicating improved discriminative power.Furthermore, the authors provide insightful analysis and interpretations of the learned latent representations. They show that the SB-VAE is able to capture the underlying structure of the data by assigning different dimensions of the latent space to distinct sources of variation. This ability to disentangle factors of variation is a crucial advantage for tasks such as unsupervised clustering and semi-supervised learning.Overall, the paper successfully introduces the concept of Stick-Breaking processes in the context of VAEs, offering a promising solution to the problem of component collapsing and enabling the learning of stochastic latent representations. The experimental results clearly illustrate the advantages of SB-VAE over Gaussian VAEs, making a significant contribution to the field of generative models and Bayesian nonparametrics.","label":26}
{"id":"f7bdc174-2def-413e-96c8-af1f814ff010","text":"This paper will make a significant contribution to the field of variational autoencoders by extending Stochastic Gradient Variational Bayes to handle weights of Stick-Breaking processes. The proposed Stick-Breaking Variational Autoencoder (SB-VAE) showcases a Bayesian nonparametric approach with a latent representation of stochastic dimensionality. The experimental results indicate that SB-VAE, including its semi-supervised variant, exhibit highly discriminative latent representations that often outperform traditional Gaussian VAEs.","label":4}
{"id":"b9d06000-ea8e-46dd-b7d8-32f323cc23f5","text":"This paper presents an approach which modifies the variational auto-encoder (VAE) framework so as to use stochastic latent dimensionality. This is achieved by using an inherently infinite prior, the stick-breaking process. This is coupled with inference tailored to this model, specifically the Kumaraswamy distribution as an approximate variational posterior. The resulting model is named the SB-VAE which also has a semi-supervised extension, in similar vein to the original VAE paper.\r\n\r\nThere's a lot of interest in VAEs these days; many lines of work seek to achieve automatic \"black-box\" inference in these models. For example, the authors themselves mention parallel work by Blei's lab (also others) towards this direction. However, there's a lot of merit in investigating more bespoke solutions to new models, which is what the authors are doing in this paper. Indeed, a (useful) side-effect of providing efficient inference for the SB-VAE is drawing attention to the use of the Kumaraswamy distribution which hasn't been popular in ML.\r\n\r\nAlthough the paper is in general well structured, I found it confusing at parts. I think the major source of confusion comes from the fact that the model specification and model inference are discussed in a somehow mixed way. The authors could have clearly separated these two aspects, providing a section specifically dedicated to the model specification and another section solely focused on the inference procedure. Additionally, the notation used in the paper could be improved, as it is not always consistent and can be confusing. The results presented in the paper, however, are quite compelling. The experimental demonstration of the SB-VAE and its semi-supervised variant shows that they are able to learn highly discriminative latent representations that often outperform the Gaussian VAE. This is a significant contribution, as it highlights the potential of Bayesian nonparametric approaches in the VAE framework. The comparison with the Gaussian VAE serves as a strong baseline and clearly demonstrates the advantages of incorporating the stick-breaking process. Overall, the paper presents an interesting and novel approach in the field of VAEs. The extension of the VAE framework to include stochastic dimensionality through the use of the stick-breaking process is a valuable contribution. With some improvements in clarity and notation, this paper has the potential to make a significant impact in the community. I would recommend the authors to carefully address the confusion in model specification and inference and revise the paper accordingly.","label":195}
{"id":"ef1e43cf-be3a-48dc-a65c-52ea9f8510d9","text":"The paper attempts to combine Variational Auto-Encoders with the Stick-Breaking process. The motivation is to tackle the component collapsing and have a representation with stochastic dimensionality. To demonstrate the merit of their approach, the authors test this model on MNIST and SVHN in an unsupervised and semi-supervised fashion.\r\nAfter reading the paper in more detail, I find that the claim that the dimensionality of the latent variable is stochastic does not seem quite correct: all latent variables are \"used\" (which actually enable backpropagation) but the latent variables are parametrized differently (into $\\pi$) and the decoding process is altered as to give the impression of sparsity..However, it is important to note that the authors have convincingly shown that this altered parametrization and decoding process indeed leads to highly discriminative latent representations, as demonstrated in their experiments on MNIST and SVHN datasets. The authors compare the performance of the Stick-Breaking Variational Autoencoder (SB-VAE) with the Gaussian VAE and show that the SB-VAE often outperforms the Gaussian VAE in terms of the quality of the learned latent representations. The experimental results showcase the effectiveness of the proposed SB-VAE in addressing the component collapsing issue and obtaining a latent representation with better discriminative power. The use of the Stick-Breaking process in the VAE framework allows for a more flexible and expressive modeling of the latent space. This is further supported by the improved performance of the semi-supervised variant of the SB-VAE, which leverages both labeled and unlabeled data. The paper is well-written and presents a clear description of the proposed SB-VAE model and its motivation. The experimental setup is robust, and the obtained results are statistically sound. Overall, this paper contributes to the field of generative models by combining the strengths of Variational Autoencoders with the Stick-Breaking process to obtain more powerful and discriminative latent representations.","label":104}
{"id":"10189567-44ca-4d8c-a7eb-da1a9606e303","text":"Very Interesting work for extending VAE towards performing posterior inference for the weights of Stick-Breaking processes. The Stick-Breaking Variational Autoencoder (SB-VAE) introduced in this work extends the capabilities of the Gaussian VAE by allowing a latent representation with stochastic dimensionality. The experimental results demonstrate that the SB-VAE, along with its semi-supervised variant, can learn discriminative latent representations that often surpass the performance of Gaussian VAE's. Overall, this extension of VAE towards Stick-Breaking processes is highly interesting and promising in the field of Bayesian nonparametrics.","label":7}
{"id":"19ffaad9-cb45-44df-b0e7-70708ee8f75a","text":"Summary: This is the first work to investigate stick-breaking priors, and corresponding inference methods, for use in VAEs. The background material is explained clearly, as well as the explanation of the priors and posteriors and their DNCP forms. The paper is really well written.\r\n\r\nIn experiments, they find that stick-breaking priors does not generally improve the VAE performance compared to traditional Gaussian priors. However, they do find that the SB-VAE and its semi-supervised variant consistently learn highly discriminative latent representations that often outperform the Gaussian VAE's. This demonstrates the usefulness and potential of stick-breaking priors in the context of variational autoencoders. The experimental results are supported by a thorough evaluation, including both quantitative metrics and qualitative analysis of the latent spaces. The authors also provide insightful discussions on the implications of their findings, highlighting the advantages and limitations of the SB-VAE approach. In addition, the paper is well-organized, with clear explanations of the proposed methods, derivations, and implementation details. The authors also compare their approach with related works, providing a comprehensive review of the literature. Overall, this paper makes a significant contribution to the field of VAEs by introducing stick-breaking priors and demonstrating their effectiveness in learning discriminative latent representations.","label":53}
{"id":"7d68d265-0ed4-404d-8731-b0de51252709","text":"The authors propose a Gated Muiltimodal Unit to combine multi-modal information (visual and textual). They also collect a large dataset of movie summers and posters..The authors' proposal of a Gated Multimodal Unit for combining visual and textual information is innovative and important for multimodal learning. Additionally, their effort to collect a large dataset of movie summaries and posters enhances the significance of their work. By addressing the limitations of single-modality approaches and outperforming other fusion strategies, including mixture of experts models, the Gated Multimodal Unit (GMU) proves to be effective in improving the macro f-score performance for genre classification. Overall, this paper presents a valuable contribution to the field of multimodal learning and provides a new benchmark dataset, MM-IMDb, for genre prediction on movies.","label":25}
{"id":"9f7de717-be97-4f78-96b8-0e2056b38977","text":"We have added a new version which includes the mixture of experts evaluation. Since this is a multilabel scenario, we implement tied and untied gates for the outputs.. The results of this evaluation will further enhance the understanding of how the GMU model compares to other fusion strategies, providing valuable insights into its effectiveness and potential applications in multimodal learning tasks.","label":28}
{"id":"7c7a0573-a18b-4d0a-a0d2-75da6725e9d8","text":"Paper proposes Gated Muiltimodal Unit, a building block for connectionist models capable of handling multiple modalities.\r\n\r\n(Figure 2) The bimodal case returns weighted activation by gains of gating units, do you do anything special to keep multi-modal case weighted as well?.In the paper, the authors proposed a novel model called Gated Multimodal Unit (GMU) for multimodal learning. The GMU is designed to be used as an internal unit in a neural network architecture, with the aim of finding an intermediate representation by combining data from different modalities. The authors introduced multiplicative gates within the GMU to determine the influence of each modality on the activation of the unit. This approach allows the model to effectively learn how modalities interact and influence the output.One interesting aspect of the paper is the evaluation of the GMU on a multilabel scenario for genre classification of movies, using both plot and poster data. The results showed that the GMU was able to improve the macro F-score performance compared to single-modality approaches. Furthermore, the GMU outperformed other fusion strategies, including mixture of experts models, indicating its effectiveness in leveraging multimodal information for genre prediction.In addition to the proposed model, the authors released the MM-IMDb dataset, which is currently the largest publicly available multimodal dataset for genre prediction on movies. This dataset will undoubtedly serve as a valuable resource for researchers working on multimodal learning.However, in Figure 2 of the paper, the authors only show the weighted activation for the bimodal case. It would be beneficial if they could provide more details or clarification on how the weighting is handled in the multi-modal case as well. Understanding the weighting mechanism for multiple modalities would enhance the interpretation of the results and provide a clearer understanding of the GMU model.Overall, the paper presents a promising model for multimodal learning and provides valuable insights into the fusion of information from different modalities. The experimental results and the release of the MM-IMDb dataset contribute significantly to the research community. With some additional clarification on the weighting mechanism for the multi-modal case, the paper would be even more comprehensive and impactful.","label":40}
{"id":"30e01bdf-2864-4033-bae4-85caf1b26846","text":"The paper introduces Gated Multimodal Units GMUs, which use multiplicative weights to select the degree to which a hidden unit will consider different modalities in determining its activation.  The paper also introduces a new dataset, \"Multimodal IMDb,\" consisting of over 25k movie summaries, with their posters, and labeled genres.\r\n\r\nGMUs are related to \"mixture of experts\" in that different examples will be classified by different parts of the model, (but rather than routing\/gating entire examples, individual hidden units are gated.GMUs are related to \"mixture of experts\" in that different examples will be classified by different parts of the model, (but rather than routing\/gating entire examples, individual hidden units are gated based on the modalities). The paper proposes a new model, the Gated Multimodal Unit (GMU), which leverages the power of gated neural networks to learn how different modalities influence the activation of the unit. By using multiplicative gates, the model is able to determine the degree to which each modality contributes to the overall representation. The experimental evaluation conducted on the MM-IMDb dataset, the largest publicly available multimodal dataset for genre prediction on movies, demonstrates the effectiveness of the GMU. It achieves improved macro f-score performance compared to single-modality approaches and outperforms other fusion strategies, including mixture of experts models. The introduction of the MM-IMDb dataset is a valuable contribution to the research community as it provides a rich resource for further exploration of multimodal learning in the context of genre classification for movies.","label":80}
{"id":"9487ab59-158c-4813-8ac0-be94d6de8727","text":"This paper proposed The Gated Multimodal Unit (GMU) model for information fusion. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. The paper collected a large genre dataset from IMDB and showed that GMU gets good performance.\r\n\r\nThe proposed approach seems quite interesting, and the audience may expect it can be used in general scenarios beyond movie genre prediction. So it is quite straightforward that the paper should test the algorithm on more diverse datasets to validate its generalizability. Additionally, it would also be interesting to compare the performance of the GMU model with other state-of-the-art multimodal fusion techniques on different tasks such as sentiment analysis, object recognition, or speech recognition. This would provide further evidence of the effectiveness of the GMU model and its potential for application in various domains.\r\n\r\nAnother aspect that could be explored in future work is the interpretability of the GMU model. While the paper mentions that the GMU learns to decide how modalities influence the activation of the unit using multiplicative gates, it would be beneficial to gain more insights into what the model is actually learning and how it is making decisions. This could be achieved by conducting experiments to visualize the learned representations and analyzing the impact of individual modalities on the final predictions.\r\n\r\nOverall, this paper presents a promising model for multimodal learning with the Gated Multimodal Unit (GMU). The experimental results on the MM-IMDb dataset demonstrate the superiority of the GMU over single-modality approaches and other fusion strategies. The release of the dataset is also a valuable contribution to the research community. With some further investigation and improvements, the GMU model has the potential to be a strong tool for information fusion in a wide range of domains.","label":77}
{"id":"8c8b4252-9cae-4d64-9d44-175d18889325","text":"Following your comments, we've added a new revision which includes:\r\n  - More details on parameter exploration and training procedure.\r\n  - More details on the experimental setup, including parameter exploration and training procedures, have been added in the revised version of the paper. ","label":22}
{"id":"95380d9d-f833-4a0b-8d1c-facc70abdc73","text":"The authors propose a Gated Muiltimodal Unit to combine multi-modal information (visual and textual). They also collect a large dataset of movie summers and posters. Overall, the reviewers were impressed with the proposed model and its ability to improve the macro f-score performance of single-modality approaches. The use of multiplicative gates in the Gated Multimodal Unit (GMU) to determine how modalities influence unit activation is a novel and effective approach. Additionally, the release of the MM-IMDb dataset is a valuable contribution to the field, as it provides a large and publicly available resource for genre prediction on movies. Overall, this paper presents a significant advancement in multimodal learning and offers practical applications for genre classification of movies.","label":29}
{"id":"4c8c9e46-bf27-48c5-85eb-0f9c52feab2c","text":"We have added a new version which includes the mixture of experts evaluation. Since this is a multilabel scenario, we implement tied and untied gates for the outputs.. The inclusion of mixture of experts models allows for a more comprehensive evaluation of the GMU model's performance. This enhancement further showcases the versatility and effectiveness of the GMU in multimodal learning tasks.","label":28}
{"id":"df4d61ff-c63a-44f9-b5c5-1290aabf52e2","text":"Paper proposes Gated Muiltimodal Unit, a building block for connectionist models capable of handling multiple modalities.\r\n\r\n(Figure 2) The bimodal case returns weighted activation by gains of gating units, do you do anything special to keep multi-modal case weighted as well? I.e. how the equation for h in Regarding Figure 2, the paper discusses the bimodal case where the weighted activation is obtained by the gains of gating units. However, it does not explicitly mention how the multi-modal case is weighted. It would be beneficial for the authors to provide more details on this matter in order to clarify any potential concerns. It would also be interesting to know if the weighting scheme differs between the bimodal and multi-modal cases, and if so, what factors are taken into consideration to determine the appropriate weights for each modality. Additionally, the authors may consider discussing any potential limitations or challenges associated with this weighting process, as it could impact the overall performance of the GMU model. Providing further insights and explanations in this regard would enhance the clarity and comprehensiveness of the paper.\r\n\r\nOverall, the presented work is highly valuable in the field of multimodal learning. The GMU model offers a novel approach for information fusion from different modalities, and the evaluation results on the MM-IMDb dataset demonstrate its effectiveness, outperforming other fusion strategies. The release of the dataset itself is a significant contribution to the research community, as it enables future studies and benchmarking on genre prediction in movies. To further strengthen the paper, I would suggest addressing the mentioned concern regarding the weighting scheme for multi-modal cases, as well as providing additional experimental results or analysis to support the claimed improvements in performance. With these revisions, the paper would be even more impactful and useful for researchers in the field of multimodal learning and information fusion.","label":47}
{"id":"a84d56e5-e5fd-4313-813a-dd8155183fc8","text":"The paper introduces Gated Multimodal Units GMUs, which use multiplicative weights to select the degree to which a hidden unit will consider different modalities in determining its activation.  The paper also introduces a new dataset, \"Multimodal IMDb,\" consisting of over 25k movie summaries, with their posters, and labeled genres.\r\n\r\nGMUs are related to \"mixture of experts\" in that they both involve multiple sub-models or modalities that contribute to the final prediction. However, unlike mixture of experts models, GMUs use multiplicative gates to determine the influence of each modality on the unit's activation. This novel approach allows GMUs to dynamically adapt the contribution of each modality based on the input data, leading to improved performance in multimodal learning tasks. The experimental evaluation presented in the paper focuses on genre classification of movies using both textual plot summaries and posters as modalities. The results demonstrate that the GMU model outperforms single-modality approaches in terms of macro f-score performance. Furthermore, it also outperforms other fusion strategies, including mixture of experts models. This indicates the effectiveness of the proposed gated neural network architecture in capturing the complementary information present in different modalities. Additionally, the paper introduces the MM-IMDb dataset, which is the largest publicly available multimodal dataset for genre prediction on movies. This dataset provides a valuable resource for future research in multimodal learning and enables further investigation of the proposed GMU model. Overall, the paper contributes a novel approach, promising experimental results, and a valuable dataset, making it a significant contribution to the field of multimodal learning.","label":57}
{"id":"d3a5ab09-1fc7-41a6-b6e3-cacf7802c483","text":"This paper proposed The Gated Multimodal Unit (GMU) model for information fusion. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. The paper collected a large genre dataset from IMDB and showed that GMU gets good  performance in terms of macro f-score compared to single-modality approaches and other fusion strategies like mixture of experts models. The proposed model is based on gated neural networks and serves as an internal unit in a larger neural network architecture. By combining data from different modalities, the GMU is able to find an intermediate representation that enhances genre classification of movies. The evaluation was conducted on the MM-IMDb dataset, which is the largest publicly available multimodal dataset for genre prediction on movies. This dataset, along with the paper, contributes to the field of multimodal learning and provides valuable resources for further research. One strength of the GMU model is its ability to learn the influence of each modality on the activation of the unit through the use of multiplicative gates. This adaptive gating mechanism allows the GMU to effectively capture the relationships and interactions between different modalities, leading to improved performance. Overall, the paper presents a well-designed and promising approach to multimodal learning. The experimental results demonstrate the effectiveness of the GMU model and its superiority over other fusion strategies. Additionally, the release of the MM-IMDb dataset is a significant contribution to the research community, as it enables further exploration and comparison of different multimodal learning techniques for genre prediction on movies. However, there are a few areas that could be further clarified or expanded upon in the paper. For example, more details about the neural network architecture in which the GMU is implemented would provide a better understanding of how the model fits into the overall framework. Additionally, a more thorough discussion of the limitations and potential future directions of the proposed approach would be beneficial. Nevertheless, the overall quality of the paper is commendable, and the GMU model holds great potential for advancing multimodal learning in various domains.","label":43}
{"id":"de1fb381-9979-49e4-b1c3-85c0e31db2f3","text":"Following your comments, we've added a new revision which includes:\r\n  - More recent related work on multimodal learning approaches that have achieved state-of-the-art results in various tasks. We believe that the addition of this information will strengthen the paper and provide a more comprehensive analysis of the proposed model's performance. Additionally, we have included additional details on the evaluation metrics used and the significance of the improved macro f-score performance. Overall, these revisions should address the concerns raised by the reviewers and enhance the clarity and impact of the paper.","label":13}
{"id":"cf21a140-07f5-4750-9503-2bf75648a491","text":"Some of the key details in this paper are very poorly explained or not even explained at all. The model sounds interesting and there may be something good here, but it should not be published in it's current form. \r\n\r\nSpecific comments:\r\n\r\nThe description of the R_l,pi convolutions in Section 2.1 was unclear. Specifically, I wasn't confident that I understood what the labels pi represented.\r\n\r\nThe description of the SAEN structure in section 2.2 was worded poorly. My understanding, based on Equation 1, is that the 'shift' operation is simply a summation of the representations of the member objects, and that the 'aggregate' operation simply concatenates the representations from multiple relations..However, the paper needs to provide more detailed explanations and examples to clarify these operations. The lack of clarity makes it difficult to fully understand the SAEN architecture and its components. Additionally, the algorithm for domain compression mentioned in the abstract could benefit from a more thorough explanation. It is unclear how the symmetries in hierarchical decompositions are leveraged to reduce memory usage and achieve significant speedups. Without a clear understanding of this algorithm, it is challenging to assess its effectiveness. Therefore, I recommend that the authors revise the paper to provide clearer explanations, examples, and additional details about the SAEN structure and the domain compression algorithm. This will greatly enhance the readability and comprehensibility of the paper, allowing readers to better evaluate the proposed methods and their empirical evaluation on real-world social network datasets.","label":108}
{"id":"5d1de7ea-17e1-4bfe-be65-c4a59d72a704","text":"The authors present a novel architecture, called Shift Aggregate Extract Network (SAEN), for learning representations on social network data. SAEN decomposes input graphs into hierarchies made of multiple strata of objects. Vector representations of each object are learned by applying 'shift', 'aggregate' and 'extract' operations on the vector representations of its parts. This approach allows for the efficient learning of complex relationships and dependencies within social network data. One particularly innovative aspect of the proposed SAEN architecture is the algorithm for domain compression, which utilizes the symmetries in hierarchical decompositions to reduce memory usage and achieve significant speedups. The authors provide empirical evaluations on real-world social network datasets, indicating that SAEN outperforms the current state of the art in terms of performance and accuracy. The results suggest that SAEN could be a valuable tool for various applications in social network analysis. However, it would be beneficial for the authors to provide more details on the specific datasets used and the metrics used for evaluation. Furthermore, it would be interesting to see a more comprehensive comparison with existing methods to better understand the advantages and limitations of SAEN.","label":22}
{"id":"1e2fc8ad-f923-4ac6-9500-027e7bb393c3","text":"the paper proposed a method mainly for graph classification.. known as the Shift Aggregate Extract Network (SAEN). SAEN introduces an innovative architecture for learning representations on social network data through the decomposition of input graphs into hierarchies. This decomposition involves multiple strata of objects, wherein vector representations of each object are learned using the 'shift', 'aggregate', and 'extract' operations on the vector representations of its parts. Additionally, the authors propose an algorithm for domain compression that capitalizes on symmetries in hierarchical decompositions to minimize memory usage and achieve significant speed improvements. To demonstrate the effectiveness of their approach, the researchers conduct an empirical evaluation on real-world social network datasets. The results indicate that SAEN outperforms the current state of the art methods, validating the efficacy of the proposed architecture for graph classification tasks.","label":9}
{"id":"9fad2171-7ab1-4218-9601-6df47acbfa41","text":"A) we rewrote section 2:\r\n    A1) we generally improved the wording,\r\n    A2) we added a figure in order to exemplify H-hierarchical decompositions,\r\n    A3) we improved the explanation of the \\pi labels (now called \"membership types\"),\r\n    A4) we rewrote and A5) we expanded the explanation of the shift operation to make it clearer how it updates vector representations based on the relationship between objects. B) In section 4, we added a detailed description of our domain compression algorithm, highlighting its advantages in reducing memory usage and improving speed. C) We conducted extensive experiments on various real-world social network datasets to evaluate the performance of our method. Our results demonstrated significant improvements over the current state of the art. D) Overall, these revisions address the previous concerns and enhance the clarity and effectiveness of the paper.","label":51}
{"id":"3986be70-da65-440a-8f39-4f6347a3776d","text":"The paper contributes to recent work investigating how neural networks can be used on graph-structured data. As far as I can tell, the proposed approach is the following:\r\n\r\n  1. Construct a hierarchical set of \"objects\" within the graph. Each object consists of multiple \"parts\" from the set of objects in the level below. There are potentially different ways a part can be part of an object (the different \\pi labels), which I would maybe call \"membership types\". In the experiments, the objects at the hierarchical level are formed by randomly selecting parts from the objects in the previous level. This method of constructing objects allows for flexibility in representing different structures and patterns within the graph. \r\n\r\nThe authors introduce three operations, 'shift', 'aggregate', and 'extract', to learn vector representations for each object. The 'shift' operation measures the influence of each part on the others within the object, providing a way to capture relationships and dependencies. The 'aggregate' operation combines the shifted representations of the parts to create a holistic representation for the object, enabling higher-level reasoning. Finally, the 'extract' operation extracts relevant features from the aggregated representation, enhancing the discriminative power of the model. The combination of these operations forms the SAEN architecture, which is tailored specifically for learning on social network data.\r\n\r\nOne interesting aspect of the proposed approach is the algorithm for domain compression, which exploits symmetries in hierarchical decompositions. This compression technique reduces memory usage and improves computational efficiency, making the SAEN model more scalable and practical for large-scale social network datasets. The authors provide empirical evaluations on real-world social network datasets, comparing their approach to the current state-of-the-art methods. The results show that the SAEN model outperforms existing approaches in terms of both accuracy and efficiency.\r\n\r\nOverall, the paper presents a novel architecture, SAEN, for learning representations on social network data. The hierarchical decomposition of the input graphs combined with the shift, aggregate, and extract operations allows for capturing complex relationships and patterns within the network. The algorithm for domain compression further enhances the scalability of the model. The empirical evaluations demonstrate the superiority of the proposed approach over existing methods. However, there are a few aspects that could be further clarified in the paper. For example, more details on the selection criteria and trade-offs for the different types of shifts and aggregates would enhance the understanding of the model. Additionally, a more comprehensive discussion on the interpretability of the learned representations and the meaning of the extracted features would be beneficial.\r\n\r\nIn conclusion, the Shift Aggregate Extract Network (SAEN) presents a promising approach to representation learning on social network data. The combination of hierarchical decomposition, shift, aggregate, and extract operations, along with the domain compression algorithm, offers a powerful framework for analyzing and extracting meaningful information from complex network structures. The experimental results provide evidence of the effectiveness and efficiency of the SAEN model in capturing graph patterns and outperforming existing methods. With some additional clarifications and discussions, this work has the potential to make significant contributions in the field of graph neural networks and social network analysis.","label":85}
{"id":"1fcbb5eb-42e0-4cfa-b543-79b44242351c","text":"Some of the key details in this paper are very poorly explained or not even explained at all. The model sounds interesting and there may be something good here, but it should not be published in it's current form. \r\n\r\nSpecific comments:\r\n\r\nThe description of the R_l,pi convolutions in Section 2.1 was unclear. Specifically, I wasn't confident that I understood what the labels pi represented.\r\n\r\nThe description of the SAEN structure in section 2.2 was worded poorly..The description of the R_l,pi convolutions in Section 2.1 was unclear. Specifically, I wasn't confident that I understood what the labels pi represented. More clarification and examples should be provided to enhance the understanding of this convolution operation. Additionally, the description of the SAEN structure in section 2.2 was worded poorly, making it difficult to grasp the overall architecture. It should be rewritten with clearer and more concise language, providing a step-by-step explanation of the operations involved in 'shift', 'aggregate', and 'extract'. Moreover, including visual illustrations or diagrams could greatly aid in comprehending the SAEN structure. Overall, improving the explanations of these key details would greatly enhance the quality and clarity of the paper. Additionally, an extended explanation of the proposed algorithm for domain compression would be beneficial. Although the paper claims significant speedups and memory reduction, further details and experimental results should be provided to validate these claims and illustrate the algorithm's effectiveness.","label":74}
{"id":"bf61ab9a-4ebc-433f-a6ee-fd7375e8e12b","text":"Some of the key details in this paper are very poorly explained or not even explained at all. The model sounds interesting and there may be something good here, but it should not be published in it's current form. \r\n\r\nSpecific comments:\r\n\r\nThe description of the R_l,pi convolutions in Section 2.1 was unclear. Specifically, I wasn't confident that I understood what the labels pi represented.\r\n\r\nThe description of the SAEN structure in section 2.2 was worded poorly. My understanding, based on Equation 1, is that the 'shift' operation is simply a summation of the representations of the member objects, and the 'aggregate' operation then applies a transformation to the summed representation, and the 'extract' operation selects a subset of dimensions from the aggregated representation. However, this explanation was not clear in the paper and could benefit from further elaboration or examples. Additionally, the proposed algorithm for domain compression was mentioned but not explained in detail. It would be helpful to have more information on how symmetries in hierarchical decompositions are leveraged and how exactly the memory usage reduction and speedups are achieved. Without this information, it is difficult to fully evaluate the effectiveness and novelty of the proposed method. Overall, while the concept of the Shift Aggregate Extract Network (SAEN) is intriguing and the empirical evaluation on real-world social network datasets is promising, the lack of clarity and detail in the paper hinders its potential impact. The authors should address these issues and provide more comprehensive explanations to make this paper suitable for publication.","label":97}
{"id":"10309bca-4db2-49ea-889d-536ddc0e125a","text":"The authors present a novel architecture, called Shift Aggregate Extract Network (SAEN), for learning representations on social network data..The SAEN architecture proposed by the authors is an innovative approach for learning representations on social network data. By decomposing input graphs into hierarchies of objects and applying shift, aggregate, and extract operations, vector representations of each object are efficiently learned. This methodology allows for a structured representation that captures both local and global information within the social network data. Moreover, the authors introduce an algorithm for domain compression that exploits symmetries in hierarchical decompositions, resulting in reduced memory usage and significant speed improvements. The empirical evaluation conducted on real-world social network datasets demonstrates the superior performance of the proposed method compared to the current state of the art in representation learning. This research not only contributes to the field of social network analysis but also has potential applications in various domains such as recommendation systems and community detection. However, further discussion on the limitations and potential extensions of the SAEN architecture would enhance the paper. Overall, this study presents a promising advancement in representation learning for social network data and opens up avenues for future research.","label":19}
{"id":"9e219e6e-1d04-4902-a781-fcfefe0dae28","text":"the paper proposed a method mainly for graph classification.. called the Shift Aggregate Extract Network (SAEN). SAEN aims to learn representations on social network data by decomposing input graphs into hierarchies comprised of multiple strata of objects. It achieves this by applying 'shift', 'aggregate', and 'extract' operations on the vector representations of the parts of each object. Additionally, the paper introduces an algorithm for domain compression that utilizes symmetries in hierarchical decompositions to reduce memory usage and yield significant speedups. The authors evaluate their method empirically on various real-world social network datasets, and their results show that SAEN outperforms the current state-of-the-art techniques in terms of graph classification performance. Overall, the paper presents a novel approach that effectively learns representations on social network data and offers promising results in terms of both memory efficiency and classification accuracy.","label":9}
{"id":"b91d73d8-7e99-4e48-95e0-c28bbf75e25e","text":".The Shift Aggregate Extract Network (SAEN) paper proposes an interesting architecture for learning representations on social network data. SAEN decomposes input graphs into hierarchical structures consisting of multiple strata of objects. This decomposition allows for the learning of vector representations for each object through the application of 'shift', 'aggregate', and 'extract' operations on the vector representations of its constituent parts. The authors also present an algorithm for domain compression, which takes advantage of symmetries in hierarchical decompositions to reduce memory usage and achieve significant speed improvements. The empirical evaluation of their method on real-world social network datasets demonstrates its superiority over the current state of the art. Overall, the paper introduces valuable contributions in understanding and exploiting the hierarchical nature of social network data, and its results have important implications for various applications in the field.","label":0}
{"id":"91e2caf5-a7eb-4eaf-a1db-51ff85739d6f","text":"The paper contributes to recent work investigating how neural networks can be used on graph-structured data. As far as I can tell, the proposed approach is the following:\r\n\r\n  1. Construct a hierarchical set of \"objects\" within the graph. Each object consists of multiple \"parts\" from the set of objects in the level below. There are potentially different ways a part can be part of an object (the different \\pi labels), which I would maybe call \"membership types\"..2. Construct a hierarchical set of \"objects\" within the graph. Each object consists of multiple \"parts\" from the set of objects in the level below. There are potentially different ways a part can be part of an object (the different \\pi labels), which I would maybe call \"membership types\". The paper introduces the Shift Aggregate Extract Network (SAEN) architecture, which aims to learn representations on social network data through decomposing input graphs into hierarchies of objects. This approach builds upon recent research that explores the application of neural networks in modeling graph-structured data.\r\n\r\nThe key idea of SAEN is to assign vector representations to each object in the graph by applying three main operations: shift, aggregate, and extract. The shift operation enables connections between objects at different levels of the hierarchy, allowing information to propagate across the network. The aggregate operation combines the vector representations of an object's parts into a single representation for the object itself, capturing both local and global context. Finally, the extract operation extracts meaningful features from the object representation.\r\n\r\nOne interesting contribution of the paper is the proposed algorithm for domain compression, which takes advantage of symmetries in hierarchical decompositions to reduce memory usage and achieve significant speed improvements. This compression technique is especially relevant for large-scale social network datasets, where memory and computational efficiency are critical.\r\n\r\nThe empirical evaluation of SAEN on real-world social network datasets demonstrates its superior performance compared to the current state-of-the-art methods. The experiments show that SAEN not only achieves higher accuracy in various prediction tasks, such as link prediction and node classification, but it also exhibits robustness and scalability when applied to large-scale networks.\r\n\r\nIn summary, the paper presents the Shift Aggregate Extract Network (SAEN) architecture for learning representations on social network data. The proposed approach leverages hierarchical decompositions and incorporates shift, aggregate, and extract operations to capture the structural and contextual information present in the network. The algorithm for domain compression further enhances the efficiency of SAEN with significant memory and speed improvements. Overall, the empirical evaluation demonstrates the superiority of SAEN over existing methods, making it a valuable contribution to the field of graph-structured data analysis.","label":78}
{"id":"98dfd6b5-ccec-40ef-82b6-a35c50cef63e","text":"Some of the key details in this paper are very poorly explained or not even explained at all. The model sounds interesting and there may be something good here, but it should not be published in it's current form. \r\n\r\nSpecific comments:\r\n\r\nThe description of the R_l,pi convolutions in Section 2.1 is not clear and lacks sufficient explanation. It is crucial to provide a detailed description of the R_l,pi convolutions in order for readers to understand the methodology behind the 'shift', 'aggregate', and 'extract' operations. Additionally, the algorithm for domain compression mentioned in the abstract is briefly mentioned but not elaborated on in the paper. It is important to provide a clear explanation of how this compression algorithm works, as it is claimed to significantly reduce memory usage and improve speed. Without a detailed explanation, it is difficult to assess the effectiveness of this method. Furthermore, while the paper mentions empirical evaluation on real-world social network datasets and claims to outperform the current state of the art, there is a lack of information regarding the specific experiments conducted and the results obtained. It would greatly strengthen the paper if the authors provided more detail on the datasets used, the experimental setup, and the performance metrics used for evaluation. Overall, this paper has potential, but it requires significant revisions and additional information to adequately explain and support the proposed approach.","label":49}
{"id":"bbd41ebb-d28c-4957-b355-18b23423f141","text":"While I understand the difficulty of collecting audio data from animals, I think this type of feature engineering does not go in the right direction. I would rather see a model than learns from the raw audio data directly, without the need for an intermediate representation like the Chirplet Transform. Additionally, the paper lacks a thorough comparison with other state-of-the-art methods and does not provide enough details about the experimental setup and results. Overall, I believe more work needs to be done to validate the effectiveness and generalizability of the Fast Chirplet Transform as a preprocessing step for CNN machine listening.","label":33}
{"id":"36b8b2a7-7ea2-4857-b6b1-8b360c61d6c1","text":"This paper studies efficient signal representations to perform bioacoustic classification based on CNNs. Contrary to image classification, where most useful information can be extracted with spatially localized kernels, bioacoustic signatures are more localized in the frequency domain, requiring to rethink the design of convolutional architectures. The authors propose to enforce the lower layers of the architecture with chirplet transforms, which are localized in the time-frequency plane as wavelets, but with time-varying central frequency. They present a Fast Chirplet Transform (FCT) algorithm and evaluate its efficiency on large environmental datasets such as Orca recordings and the LifeClef challenge dataset containing 1000 Birds species. The results are promising, showing that FCT significantly reduces the training duration for birds and vowels classification tasks, with a reduction of -28% and -26% respectively. Additionally, the scores are improved with FCT pretraining, yielding a relative gain of +7.8% in Mean Average Precision for birds classification and +2.3% in vowel accuracy compared to raw audio CNN. These findings demonstrate the effectiveness of using FCT as a preprocessing step to accelerate CNN training and improve classification performance. The authors also discuss the potential applications of FCT in tonotopic deep machine listening and inter-species bioacoustic transfer learning to enhance the representation of animal communication systems. Overall, the paper presents valuable insights into the use of Chirplet kernel and FCT for enhancing CNN machine listening in bioacoustic classification tasks.","label":75}
{"id":"5ae31620-5622-4324-9c75-490227f13ce7","text":"While I understand the difficulty of collecting audio data from animals, I think this type of feature engineering does not go in the right direction. I would rather see a model than learns the feature representation from data.  I think that relying on a Chirplet kernel as a pretraining method for CNN is not the most effective approach. Although there may be computational efficiency gains and some improvement in classification performance, it may not generalize well to other datasets or tasks. Instead, I would suggest exploring more advanced techniques such as transfer learning or designing neural network architectures specifically for bioacoustic data. This could lead to more robust and accurate models for animal calls and speech recognition.","label":40}
{"id":"b91cc22f-ddc8-493f-a5c6-f535758aeb50","text":"Pros: \r\n- Introduction of a nice filter banks and its implementation\r\n- Good numerical results\r\n- Refinement of the representation via back propagation, and a demonstration that it speeds up learning\r\n\r\nCons:\r\n- The algorithms (section 3.1) are not necessary, but rather it would be more effective to focus on providing a clear explanation of the Fast Chirplet Transform (FCT) algorithm and its implementation. Additionally, it would be helpful to provide more details on the experimental setup and methodology used for validation, as well as discussing potential limitations or future directions for the research.","label":36}
{"id":"62cf5506-bb4f-429a-911c-fef9a778e9f6","text":"The authors advocate use of chirplets as a basis for modeling audio signals.  They introduce a fast chiplet transform for efficient computation. Also introduced is the idea of initializing (pre-training) CNN layers to mimic chirplet transform of audio signal (similar to ideas proposed by Mallet et al. on scattering.framework). In this paper, the authors validate the proposed Fast Chirplet Transform (FCT) on animal calls and speech. They demonstrate the computation efficiency of FCT on large environmental databases, including months of Orca recordings and 1000 bird species from the LifeClef challenge. Furthermore, they evaluate FCT on the vowels subset of the Speech TIMIT dataset and show that it accelerates CNN and improves classification performance. The results indicate a reduction in training duration by 28% for bird classification and 26% for vowels classification, along with relative gains of 7.8% in Mean Average Precision for bird classification and 2.3% in vowel accuracy compared to raw audio CNN. The authors conclude by discussing the potential of tonotopic FCT deep machine listening and inter-species bioacoustic transfer learning to generalize the representation of animal communication systems.","label":50}
{"id":"fd4262ff-b722-4aae-a852-5b722a2d6b55","text":"While I understand the difficulty of collecting audio data from animals, I appreciate the efforts made by the authors in collecting a large environmental database for their study. This database includes months of Orca recordings and 1000 bird species from the LifeClef challenge. The extensive data set adds credibility and reliability to their research findings. Furthermore, the authors demonstrate the computation efficiency of their Fast Chirplet Transform (FCT) algorithm on this large database, showcasing its potential for real-world applications. Overall, the inclusion of such a diverse and extensive dataset strengthens the validity and generalizability of the proposed method.","label":12}
{"id":"a82478a5-b2fb-4baf-8183-daeb389df5f8","text":"This paper studies efficient signal representations to perform bioacoustic classification based on CNNs. Contrary to image classification, where most useful information can be extracted with spatially localized kernels, bioacoustic signatures are more localized in the frequency domain, requiring to rethink the design of convolutional architectures. The authors propose to enforce the lower layers of the architecture with chirplet transforms, which are localized in the time-frequency plane as wavelets, in order to capture the temporal and spectral characteristics of the bioacoustic signals. The paper presents the Fast Chirplet Transform (FCT), which offers an efficient algorithm for computing the chirplet transform. The authors demonstrate the effectiveness of FCT on a large environmental database, including Orca recordings and bird species from the LifeClef challenge. Additionally, the FCT is validated on the vowels subset of the Speech TIMIT dataset. The results reveal that the FCT significantly accelerates the training process of CNNs, reducing the duration by -28% for bird classification and -26% for vowel classification. Furthermore, the scores are improved with FCT pretraining, showing a relative gain of +7.8% in Mean Average Precision for birds and +2.3% in vowel accuracy compared to raw audio CNNs. This paper opens up new possibilities for tonotopic FCT deep machine listening and inter-species bioacoustic transfer learning, enabling more robust and generalized representations for animal communication systems.","label":68}
{"id":"3a0265ba-545d-4e22-8e25-81cf258208df","text":"While I understand the difficulty of collecting audio data from animals, I think this type of feature engineering does not go in the right direction. I would caution against relying solely on pretraining with a Chirplet kernel. While it may provide some advantages in terms of computation efficiency and training duration, it is important to consider the limitations and potential biases introduced by this approach. Additionally, the paper should provide more details on the experimental setup, including the architecture of the CNN used, the selection criteria for the datasets, and the evaluation metrics employed. Furthermore, a comparison with other state-of-the-art methods would strengthen the validation and demonstrate the effectiveness of the proposed Fast Chirplet Transform. Overall, more analysis and discussion are needed to fully assess the impact and generalizability of this approach.","label":27}
{"id":"91ab05ad-54cd-4fa8-8b9a-9afb407ed4eb","text":"Pros: \r\n- Introduction of a nice filter banks and its implementation\r\n- Good numerical results\r\n- Refinement of the representation via back propagation, and a demonstration that it speeds up learning\r\n\r\nCons:\r\n- The algorithms (section 3.1) are not necessary, and they even affect the presentation of the paper..- The algorithms (section 3.1) are not necessary, and they even affect the presentation of the paper. However, the authors could improve the clarity of their explanation in this section to provide a better understanding of the proposed Fast Chirplet Transform (FCT).","label":45}
{"id":"b320e628-1e82-4113-b910-259be00e11e9","text":"The authors advocate use of chirplets as a basis for modeling audio signals.  They introduce a fast chiplet transform for efficient computation. Also introduced is the idea of initializing (pre-training) CNN layers to mimic chirplet transform of audio signal (similar to ideas proposed by Mallet et al. on pre-training) CNN layers to mimic chirplet transform of audio signal (similar to ideas proposed by Mallet et al. on signal representation learning). The authors provide evidence of the effectiveness of their proposed method by conducting experiments on different datasets including Orca recordings, Bird species, and the Speech TIMIT dataset. The results show that the Fast Chirplet Transform (FCT) not only accelerates the training of CNN but also improves the classification accuracy for both birds and vowels. The findings suggest that FCT has the potential to be used as a bioacoustic representation in deep machine listening and can also facilitate inter-species bioacoustic transfer learning. However, further research is required to explore its applicability to other domains and to compare it with existing techniques.","label":49}
{"id":"59ad0cbd-3e0e-4956-9e1b-69aaeba6e99d","text":"Thank you for an interesting read.\r\n\r\nGiven the huge interest in generative modelling nowadays, this paper is very timely and does provide very clear connections between methods that don't use maximum likelihood for training. It made a very useful observation that the generative and the discriminative loss do **not** need to be coupled with each other..The paper effectively highlights the importance of likelihood-free inference methods and the principle of hypothesis testing in implicit generative models. It also emphasizes the use of density ratio estimation as a means of addressing the general problem. The discussion on the different approaches for density ratio estimation, including the use of classifiers to distinguish real from generated data, divergence minimization, and moment matching, provides a comprehensive understanding of the topic. The synthesis of these approaches with the broader literature on generative adversarial networks (GANs) is insightful and demonstrates the potential for cross-pollination and future exploration. Overall, this paper makes a valuable contribution to the field of generative modeling and provides a solid foundation for further research in the area. I look forward to seeing how these ideas are extended and applied in future work.","label":55}
{"id":"446b0b52-55b0-45b7-8198-a85add14e1b7","text":"Hello Authors,\r\n\r\nCongratulations on the acceptance of the paper.\r\n\r\nI've just reread parts of the revised paper and noticed a few things that you might want to consider and change before the camera-ready deadline.\r\n\r\n* You now include a reference to KLIEP after Eqn. (16), but this procedure is in fact known as least-squares importance estimation.\r\n* in turn, Eqn. (14) is actually more akin to KLIEP, the main concerns with Eqn. (14) are that it seems to be using a similar idea as KLIEP, but the notation and derivation are unclear. It would be helpful if you could provide more explanation or clarify the connection to KLIEP in this section. Additionally, it would be beneficial to expand on the other approaches for density ratio estimation that you mentioned, such as divergence minimisation and moment matching. These concepts are briefly mentioned, but it would be valuable to have a deeper discussion and analysis of their relationships to GANs and the wider literature. Overall, these suggestions will enhance the clarity and depth of your paper and improve its potential impact. Keep up the great work!","label":65}
{"id":"0a220b32-3c01-4983-a36d-1a39c3b44113","text":"This paper provides a unifying review of recent developments in implicit generative models, with a focus on the GAN framework. The authors explore the connections between GANs and other algorithms for learning in implicit generative models, such as density ratio estimation and hypothesis testing. They provide a comprehensive overview of different approaches for density ratio estimation, including using classifiers, divergence minimisation, and moment matching. The paper also identifies areas for future exploration and cross-pollination in the field. Overall, this review offers valuable insights into the current state of research on learning in implicit generative models and provides a roadmap for further advancements.","label":6}
{"id":"4f8d8e3f-2ed5-4231-bad3-76518f2306b7","text":"The reviewers have two common concerns (1) relevance of this paper to ICLR, and (2) its novelty. We address the common concerns here and address other questions individually.\r\n\r\nThe aim of our paper is to review different approaches for learning in implicit generative models; GANs are a special case of implicit generative models and our work helps understand connections between GAN variants as well as understand how GANs are related to the wider statistical literature..In terms of relevance to ICLR, our paper contributes to the field by providing a comprehensive review of different approaches for learning in implicit generative models, with a focus on GANs. This is an important and timely topic in machine learning, as GANs have gained significant attention and have been applied to a wide range of tasks, including image synthesis, data augmentation, and anomaly detection. By understanding the connections between GAN variants and their relationship to the wider statistical literature, our paper provides valuable insights and guidance for researchers working in this area. Furthermore, our work highlights the importance of hypothesis testing as a principle for learning in implicit generative models, which can be applied to other areas of machine learning beyond GANs.In terms of novelty, while there have been previous reviews and surveys on GANs, our paper distinguishes itself by placing GANs within the wider landscape of algorithms for learning in implicit generative models. We provide a comprehensive analysis of different approaches for density ratio estimation, including the use of classifiers, divergence minimization, and moment matching. By synthesizing these views and relating them to the wider literature, we offer a fresh perspective and identify avenues for future exploration and cross-pollination. Our paper also introduces likelihood-free inference methods, which provide an alternative to traditional likelihood-based approaches for learning in implicit generative models.We are confident that our paper will make a valuable contribution to the ICLR community by providing a comprehensive and insightful review of the current state of the art in learning in implicit generative models, with a focus on GANs. Our work bridges the gap between machine learning and statistics, and provides valuable insights and guidance for researchers in both fields. We believe that the reviewer's concerns about relevance and novelty are adequately addressed, and we look forward to the opportunity to present our findings at ICLR.","label":74}
{"id":"645cf333-e7a2-4990-9634-bf9a63298a50","text":"I just noticed I submitted my review as a pre-review question - sorry about this. Here it is again, with a few more thoughts added...\r\n\r\nThe authors present a great and - as far as I can tell - accurate and honest overview of the emerging theory about GANs from a likelihood ratio estimation\/divergence minimisation perspective. It is well written and a good read, and one I would recommend to people who would like to get involved in GANs.\r\n\r\nMy main problem with this submission is that it is hard as a reviewer to pin down what precisely the novelty is - beyond perhaps articulating these views better than other papers have done in the past. A sentence from the paper \"But it has left us unsatisfied since we have not gained the insight needed to choose between them.\u201d summarises my feeling about this paper: this is a nice 'unifying review\u2019 type paper that - for me - lacks a novel insight.\r\n\r\nIn summary, my assessment is mixed: I think this is a great paper, I enjoyed reading it. I was left a bit disappointed by the lack of novel insight, or a singular key new idea which you often expect in conference presentations, and this is why I\u2019m not highly confident about this as a conference submission (and hence my low score) I am open to be convinced either way.\r\n\r\nDetailed comments:\r\n\r\nI think the authors should probably discuss the lack of novel insight more explicitly in the paper. While the authors provide a comprehensive overview of the existing literature and approaches to learning in implicit generative models, it would greatly benefit the readers if they can highlight the unique contribution or perspective they bring to this field. This could be achieved by identifying a specific problem or limitation in the current approaches and proposing a novel solution or framework that addresses it. Without this clear novelty, the paper runs the risk of being perceived as just a compilation of existing knowledge.\r\n\r\nAdditionally, I would suggest that the authors expand on their evaluation of GANs and other approaches in terms of their strengths and weaknesses. While it is mentioned that GANs have several appealing properties, it would be helpful to have a more detailed comparison with other methods in the same space. This could include discussing the limitations of GANs, such as mode collapse or training instability, and how other approaches attempt to overcome these challenges.\r\n\r\nFurthermore, I noticed that the paper briefly mentions the use of hypothesis testing as a principle for learning in implicit generative models. It would be valuable if the authors could provide more concrete examples or case studies where hypothesis testing has been successfully applied in this context. This would help to strengthen the argument for the relevance and effectiveness of this approach.\r\n\r\nLastly, I would like to see more discussion on the practical implications and potential applications of implicit generative models, particularly in domains beyond computer vision. While the paper mentions econometrics and approximate Bayesian computation, it would be interesting to explore how these models can be applied to other fields, such as natural language processing or healthcare.\r\n\r\nOverall, I believe the paper has the potential to make a significant contribution to the understanding and development of implicit generative models. However, in order to do so, the authors should focus on identifying and highlighting their unique insights or contributions, as well as providing a more in-depth analysis and evaluation of GANs and other approaches in this field. I look forward to seeing how the authors address these suggestions in the revised version.","label":235}
{"id":"d199c412-fcbc-43c4-a078-d4914d9271b4","text":"The paper provides an exposition of multiple ways of learning in implicit generative models, of which generative adversarial networks are an example.. It explores the wider landscape of algorithms for learning in implicit generative models and relates these ideas to modeling problems in econometrics and approximate Bayesian computation. The paper also delves into likelihood-free inference methods and highlights hypothesis testing as a principle for learning in implicit generative models. By deriving the objective function used by GANs and other related objectives, the paper emphasizes the general problem of density ratio estimation. It discusses four approaches for density ratio estimation, including the use of classifiers to distinguish real from generated data. Additionally, the paper synthesizes different views on divergence minimization and moment matching, offering a comprehensive understanding of their relationships and their connections to the broader literature. The paper concludes with potential avenues for future exploration and cross-pollination between these various approaches.","label":22}
{"id":"9829a3f1-70fe-4722-a666-903ad716cb99","text":"Thank you for an interesting read.\r\n\r\nGiven the huge interest in generative modelling nowadays, this paper is very timely and does provide very clear connections between methods that don't use maximum likelihood for training. It made a very useful observation that the generative and the discriminative loss do **not** need to be coupled with each other. I think this paper in summary provides some insights into the connections between different approaches for learning in implicit generative models. The paper effectively discusses the various methods for density ratio estimation, including classifier-based solutions, divergence minimization, and moment matching. By synthesizing these views, the paper offers a comprehensive understanding of the relationships between different approaches and their connections to the wider literature. Furthermore, the paper highlights the importance of hypothesis testing as a principle for learning in implicit generative models and derives the objective function used by GANs and similar models. Overall, this paper presents a valuable contribution to the field of generative modeling and provides a foundation for future research and cross-pollination between different methods.","label":63}
{"id":"9fda1952-0d6f-473c-8383-cac607782b25","text":"Thank you for an interesting read.\r\n\r\nGiven the huge interest in generative modelling nowadays, this paper is very timely and does provide very clear connections between methods that don't use maximum likelihood and generative adversarial networks (GANs) which rely on likelihood-free inference. The paper effectively highlights the importance of hypothesis testing as a principle for learning in implicit generative models, linking this viewpoint to density ratio estimation. I particularly appreciate the synthesis of different approaches for density ratio estimation, including divergence minimization and moment matching, and how they are related to the GAN literature. The paper also acknowledges the wider landscape of algorithms for learning in implicit generative models, connecting them to fields like econometrics and approximate Bayesian computation. This interdisciplinary perspective adds depth to the discussion and offers potential avenues for future research and cross-pollination. Overall, this paper contributes to the growing body of knowledge on GANs and implicit generative models, providing valuable insights and connections to related statistical thinking. I recommend further exploring the ideas presented in this paper and its potential impact on the field of generative modelling.","label":31}
{"id":"92527daa-fd48-41c7-a689-501e8c76aeaa","text":"Hello Authors,\r\n\r\nCongratulations on the acceptance of the paper.\r\n\r\nI've just reread parts of the revised paper and noticed a few things that you might want to consider and change before the camera-ready deadline.\r\n\r\n* You now include a reference to You now include a reference to the wider landscape of algorithms for learning in implicit generative models and relate these ideas to modeling problems in related fields. This broadens the scope of the paper and provides valuable connections to statistical thinking. Additionally, you mention developing likelihood-free inference methods and highlight hypothesis testing as a principle for learning in implicit generative models. This testing viewpoint brings attention to the general problem of density ratio estimation and influences the objective function used by GANs and other related objectives. I believe this perspective adds depth to the paper and contributes to the understanding of GANs in the context of density ratio estimation. You also mention exploring different approaches for density ratio estimation, including divergence minimization and moment matching. Synthesizing these views and discussing their relationships with the wider literature is a valuable contribution as it highlights avenues for future exploration and cross-pollination. Overall, I think these additions enhance the paper and make it more comprehensive in its coverage of learning in implicit generative models.","label":38}
{"id":"d4e7770c-3ae5-4240-bcce-80e1d81c3497","text":"This paper provides a unifying review of various forms of generative model..The paper provides a comprehensive overview and analysis of the current state of the field, addressing various aspects of learning in implicit generative models. It explores the connections and relationships between generative adversarial networks (GANs) and other methods for density ratio estimation, highlighting the potential for cross-pollination and future exploration. The authors also propose likelihood-free inference methods and emphasize the importance of hypothesis testing in this context. Overall, this review offers valuable insights and directions for further research in the area of implicit generative models.","label":12}
{"id":"991cf55e-603c-48db-90cb-52c1d62413dc","text":"The reviewers have two common concerns (1) relevance of this paper to ICLR, and (2) its novelty. We address the common concerns here and address other questions individually.\r\n\r\nThe aim of our paper is to review different approaches for learning in implicit generative models; GANs are a special case of implicit generative models and our work helps understand connections between GAN variants as well as understand how GANs are related to the wider statistical literature..In terms of relevance to ICLR, our paper provides a comprehensive review and synthesis of different approaches for learning in implicit generative models, which includes GANs as a special case. This topic is highly relevant to the machine learning community and aligns with the interests of ICLR. Implicit generative models have gained significant attention in recent years, and understanding their learning approaches is crucial for advancements in the field. Moreover, our paper goes beyond just presenting the existing approaches; it also highlights the connections between these approaches and the wider statistical literature. This broader perspective can inspire researchers to explore new directions and facilitate cross-pollination between different communities.\r\n\r\nAs for the novelty of our work, while it is true that GANs and their variants have been extensively studied, our contribution lies in providing a comprehensive framework that contextualizes GANs within the landscape of learning in implicit generative models. By connecting GANs to other related models and highlighting the underlying principles for learning in these models, we offer new insights and opportunities for future research. Furthermore, our paper presents a detailed exploration of density ratio estimation, a fundamental problem in implicit generative models, and synthesizes different approaches, including classifier-based solutions, divergence minimization, and moment matching. This comprehensive analysis adds novelty to our work and enhances the understanding of density ratio estimation for the machine learning community.\r\n\r\nIn summary, our paper provides a valuable contribution to the field of learning in implicit generative models, with GANs as a key focus. Our comprehensive framework, analysis of density ratio estimation, and connections to the wider statistical literature make our work highly relevant and novel. We believe that this paper merits consideration for publication at ICLR.","label":74}
{"id":"a1c50ee7-b62e-4f75-858e-0a39556c7e0d","text":"The paper provides an exposition of multiple ways of learning in implicit generative models, of which generative adversarial networks are an example. The paper is very clear, the exposition is insightful, and the presented material is clearly important.\r\n\r\nIt is It is evident that the authors have put considerable effort into understanding GANs and their role within the broader context of learning in implicit generative models. The paper effectively highlights the connections between GANs and other methods used in related fields like econometrics and approximate Bayesian computation. The emphasis on hypothesis testing as a guiding principle for learning in implicit generative models is particularly insightful, as it sheds light on the objective function used by GANs and its relationship to density ratio estimation. The synthesis of different approaches for density ratio estimation, such as divergence minimisation and moment matching, adds depth to the analysis and paves the way for future exploration in this area. Overall, this paper provides a valuable contribution to the understanding of learning in implicit generative models and opens up avenues for further research and cross-pollination among different disciplines.","label":39}
{"id":"e96aac25-77f5-4f9e-a0ab-632408a15cdb","text":"Thank you for an interesting read.\r\n\r\nGiven the huge interest in generative modelling nowadays, this paper is very timely and does provide very clear connections between methods that don't use maximum likelihood for training. It made a very useful observation that the generative and the discriminative loss do **not** need to be coupled with each other. I think this paper in summary provides a valuable contribution to the field of generative modelling. The authors successfully frame GANs within a broader context of learning in implicit generative models and highlight the connections to related fields like econometrics and approximate Bayesian computation. The paper's focus on likelihood-free inference methods and hypothesis testing offers a fresh perspective on training implicit generative models. I especially appreciate the thorough exploration of density ratio estimation approaches, including the use of classifiers in distinguishing real from generated data. By synthesizing different views and linking them to the wider literature, the authors provide a comprehensive understanding of the subject. Overall, this paper presents a compelling argument for the importance of learning in implicit generative models and opens up exciting avenues for future research and cross-pollination with other domains. Well done!","label":62}
{"id":"d0994042-443a-463a-aa74-6cc16aa3de87","text":"The authors develop a way learn subspaces of multiple views such that data point neighborhoods are similar in all of the views.  This similarity is measured between distributions of neighbors in pairs of views. The motivation is that this is a natural criterion for information retrieval.\r\n\r\nI like the idea of preserving neighborhood relationships across views for retrieval tasks. And it is nice that the learned spaces can have different dimensionalities for different views.  However, the empirical validation seems preliminary.\r\n\r\nThe paper has been revised from the authors' ICLR 2016 submission, and the revisions are welcome, but I appreciate the authors' efforts to improve the clarity and organization of the paper. The introduction provides a good background and motivation for the problem, and the research questions are clearly stated. The authors also do a good job of explaining the limitations of existing approaches, such as Canonical Correlation Analysis (CCA), and the need for a new method that directly optimizes the task of neighbor retrieval between multiple views.\r\n\r\nOne strength of the proposed method is its ability to capture nonlinear and local similarities in the data. By optimizing the mappings for each view, the method is able to detect dependencies between data relationships beyond just individual data coordinates. The use of well-understood measures of information retrieval quality further strengthens the methodology.\r\n\r\nThe experiments conducted to validate the proposed method are comprehensive and well-designed. The authors compare their method to alternative approaches and demonstrate that it outperforms them in preserving cross-view neighborhood similarities. The insights gained from analyzing the local dependencies between multiple views provide a valuable contribution to the field.\r\n\r\nHowever, there are a few areas that could be further improved. Firstly, the paper could benefit from providing more details about the algorithmic implementation of the proposed method. This would help readers better understand the steps involved and potentially replicate the experiments.\r\n\r\nAdditionally, the authors could discuss the scalability of their method. While the experiments show promising results, it would be helpful to know how the method performs on larger datasets and whether it can handle real-world applications with high-dimensional data.\r\n\r\nLastly, the paper could benefit from a more extensive discussion on the implications of the findings. The authors briefly mention that their method provides insights into local dependencies between views, but it would be interesting to explore how these insights can be practically applied in different domains.\r\n\r\nIn conclusion, the paper presents a novel method for finding dependent subspaces of multiple views, optimized for the task of neighbor retrieval. The empirical validation demonstrates the effectiveness of the proposed method, although further details on implementation, scalability, and practical applications would enhance the contribution. Overall, the paper makes a valuable contribution to the field of information retrieval and provides a solid foundation for future research in this area.","label":98}
{"id":"d0597cc7-f33a-4729-865a-e394ea725931","text":"The reviewers agree that there are issues in the paper's methodology, specifically the introduction of a new method for finding dependent subspaces of multiple views, is novel and directly optimized for the data analysis task of neighbor retrieval. The proposed method outperforms alternatives in preserving cross-view neighborhood similarities and offers insights into local dependencies between multiple views.","label":8}
{"id":"44bc93de-9532-45fc-8e35-23c5997b9211","text":"This paper proposes a multiview learning approach to finding dependent subspaces optimized for maximizing cross-view similarity between neighborhoods of data samples. The motivation comes from information retrieval tasks. Authors position their work as an alternative to CCA-based multiview learning; note, however, that CCA based techniques have very different purpose and are rather broadly applicable than the setting considered here. Main points: \r\n\r\n- I am not sure what authors mean by time complexity. It would appear that they simply report the computational cost of evaluating the objective in equation (7). Is there a sense of how many iterations of the L-BFGS method?.Regarding the time complexity, the authors do not provide a clear explanation in the paper. From what I understand, the authors seem to refer to the computational cost of evaluating the objective function in equation (7), but they do not specify how many iterations of the L-BFGS method were performed. It would be helpful if the authors could provide more details on the computational complexity of their method, including the number of iterations required for convergence. Additionally, it would be beneficial to compare the time complexity of their approach with other existing methods, such as CCA-based techniques, to understand the efficiency of their proposed method. Overall, the paper presents a novel approach for finding dependent subspaces of multiple views that is optimized for neighbor retrieval between views. The method is well-motivated and the criterion used is directly related to the retrieval task, capturing nonlinear and local similarities. The experimental results also demonstrate that the proposed method outperforms alternatives in preserving cross-view neighborhood similarities and provides insights into local dependencies between multiple views. However, providing more details on the time complexity and comparing it with other methods would strengthen the paper's contribution and make it more comprehensive.","label":101}
{"id":"21722016-ad3e-4d52-85ac-e0393e5a9334","text":"This paper presents an multi-view learning algorithm which projects the inputs of different views (linearly) such that the neighborhood relationship (transition probabilities) agree across views.\r\n\r\nThis paper has good motivation--to study multi-view learning from a more information retrieval perspective. The authors propose a novel method for finding dependent subspaces of multiple views, specifically optimized for the task of neighbor retrieval between views. They introduce linear transformations for each view to maximize cross-view similarity between neighborhoods of data samples. This approach differs from the traditional Canonical Correlation Analysis (CCA), which only considers correlation between data coordinates. By formulating the criterion based on the retrieval task, the proposed method captures nonlinear and local similarities in data relationships, moving beyond individual data coordinates. This is an important contribution, as it aligns with the information retrieval perspective. The experiments conducted in the paper demonstrate the effectiveness of the proposed approach. It outperforms alternative methods in preserving cross-view neighborhood similarities, highlighting its superiority in capturing meaningful dependencies between multiple views. This is crucial both for exploratory analysis and predictive tasks, as finding relationships between views can significantly enhance the understanding of complex datasets. The insights obtained from the proposed method shed light on local dependencies, providing valuable knowledge about the underlying structure of the data.One strength of the paper is its clear presentation of the methodology. The authors provide a detailed description of the optimization process for finding dependent subspaces and explain the rationale behind the choice of linear transformations. Additionally, the authors show that the proposed method is related to well-understood measures of information retrieval quality, which adds to its credibility.However, there are some areas that could be further improved. Firstly, the authors might consider discussing the limitations of their approach, acknowledging potential challenges or scenarios where it may not perform optimally. Additionally, more comparisons with existing state-of-the-art methods would strengthen the evaluation of the proposed approach.In conclusion, the paper presents a promising approach for finding dependent subspaces of multiple views, particularly suitable for neighbor retrieval tasks. The methodology is well-formulated, and the experimental results provide evidence of its superiority over alternative methods. With some minor revisions and additional comparisons, this paper has the potential to make a significant impact in the field of multi-view learning.","label":38}
{"id":"94b9603b-db37-40d8-b516-32397de68999","text":"The authors develop a way learn subspaces of multiple views such that data point neighborhoods are similar in all of the views.  This similarity is measured between distributions of neighbors in pairs of views. The motivation is that this is a natural criterion for information retrieval.\r\n\r\nI like the idea of preserving neighborhood relationships across views for retrieval tasks. And it is nice that the learned spaces can have different dimensionalities for different views.  However, the empirical validation seems preliminary.\r\n\r\nThe paper has been revised from the authors' ICLR 2016 submission, and the revisions are welcome, but I think the paper still needs more work in order to be considered for publication. First, the authors should provide more details about the experiments conducted to evaluate their proposed method. The current explanation of the experiments is insufficient, and it is unclear how the proposed method outperforms the alternatives. Additionally, the authors should compare their approach to existing methods in the literature to demonstrate the novelty and effectiveness of their method. This would provide a better understanding of the contributions of the proposed method. Furthermore, the authors should discuss the limitations of their approach and potential areas for future research. It would be helpful to identify any assumptions made in the proposed method or limitations in its applicability. This would provide a clearer understanding of the scope and potential drawbacks of using the method in different scenarios. Another point that requires improvement is the clarity of the writing. While the paper presents an interesting concept, the explanations and descriptions are not always clear and might be challenging for readers to follow. The authors should revise the manuscript to provide clearer explanations of the proposed method, the experimental setup, and the results obtained. Additionally, the paper would benefit from the inclusion of more visual aids, such as graphs or diagrams, to better illustrate the concepts and results. Visualizations can help readers understand complex ideas more easily and provide further support to the authors' claims.Finally, it would be valuable if the authors discussed the potential applications of their proposed method beyond the specific case of information retrieval. By exploring other domains or problem settings where the proposed method could be applied, the authors could highlight the versatility and relevance of their approach.In summary, while the idea of finding dependent subspaces of multiple views for information retrieval is intriguing, the current version of the paper requires further improvements. The authors should provide more details on the experiments, compare their method to existing approaches, address potential limitations, enhance the clarity of the writing, include visual aids, and discuss potential applications in other domains. By addressing these points, the authors can strengthen the paper and solidify the contributions of their proposed method.","label":107}
{"id":"2241ed13-1960-4596-9fd7-f808ee44d28c","text":"The authors develop a way learn subspaces of multiple views such that data point neighborhoods are similar in all of the views.  This similarity is measured between distributions of neighbors in pairs of views. The motivation is that this is a natural criterion for information retrieval.\r\n\r\nI like the idea of preserving neighborhood relationships across views for retrieval tasks. And it is nice that the learned spaces can have different dimensionalities for different views.  However, the empirical validation seems preliminary.\r\n\r\nThe paper has been revised from the authors' ICLR 2016 submission, and the revisions are welcome, but I think the paper still needs more work in order to be publishable.  In its current form it could be a good match for the workshop track.\r\n\r\nThe experiments are all on very small data sets (e.g. 2000 examples in each of train\/test on the MNIST task) and not on real tasks.  The authors do not provide enough justification for why these small and synthetic datasets are representative of real-world tasks. It would be more convincing if the authors could include experiments on larger and more complex datasets to demonstrate the effectiveness of their method.\r\n\r\nAnother weakness of the paper is that the authors do not compare their method with existing approaches for finding relationships between multiple views, such as Canonical Correlation Analysis (CCA). Although the authors mention that CCA is a prominent approach, they do not provide any comparisons or benchmarks to demonstrate how their method outperforms CCA or other alternatives. Including such comparisons would strengthen the paper and provide a better understanding of the advantages of the proposed method.\r\n\r\nFurthermore, while the paper mentions that the proposed method detects nonlinear and local similarities, it would be helpful if the authors could provide more details on how their method achieves this. Are there any specific techniques or algorithms used to capture nonlinear and local dependencies between the data samples? Providing more insights into the methodology would enhance the clarity and reproducibility of the results.\r\n\r\nOverall, the paper presents an interesting approach for finding dependent subspaces of multiple views and preserving cross-view neighborhood similarities. However, further improvements are needed in terms of empirical validation, comparisons with existing methods, and providing more details on the methodology. With these enhancements, the paper has the potential to make a valuable contribution to the field of information retrieval and exploratory analysis.","label":152}
{"id":"b1d45b88-1ab9-4bfe-9a1f-bd91221e1356","text":"The reviewers agree that there are  several strengths to this paper. The proposed method offers a novel approach to finding dependent subspaces of multiple views, optimizing for the specific task of neighbor retrieval. By introducing nonlinear and local similarities, as well as measuring the dependency of data relationships rather than just individual coordinates, the method shows promise in improving cross-view neighborhood preservation. The experiments conducted also provide valuable insights into local dependencies between multiple views.","label":6}
{"id":"f38c70a8-4c67-4d51-9def-e0ad17648bb1","text":"This paper proposes a multiview learning approach to finding dependent subspaces optimized for maximizing cross-view similarity between neighborhoods of data samples. The motivation comes from information retrieval tasks. Authors position their work as an alternative to CCA-based multiview learning; note, however, that CCA based techniques have very different purpose and are rather broadly applicable than the setting considered here. Main points: \r\n\r\n- I am not sure what authors mean by time complexity. It would appear that they simply report the computational cost of evaluating the objective in equation (7). Is there a sense of how many iterations of the L-BFGS method? Since that is going to be difficult given the nature of the optimization problem, one would appreciate more information on the time complexity of the proposed method. Additionally, it would be helpful to provide some comparative analysis with existing CCA-based approaches to better understand the advantages and limitations of the proposed method. Overall, the paper presents an interesting approach to finding dependent subspaces and maximizing cross-view similarity for information retrieval tasks. The motivation and problem formulation are well-described, and the method appears to be promising based on the experimental results. However, there are a few areas that could be improved in terms of clarity and thoroughness, such as providing more details on the time complexity and providing a comparative analysis with CCA-based approaches. With these revisions, the paper would further strengthen its contribution to the field of multiview learning and information retrieval.","label":118}
{"id":"78ca4ac5-e53c-4995-90ae-9201ed0a1979","text":"This paper presents an multi-view learning algorithm which projects the inputs of different views (linearly) such that the neighborhood relationship (transition probabilities) agree across views.\r\n\r\nThis paper has good motivation--to study multi-view learning from a This paper has good motivation--to study multi-view learning from a perspective that goes beyond simple correlation analysis and focuses on the specific task of neighbor retrieval between multiple views. The authors propose a novel method that optimizes linear transformations for each view to maximize cross-view similarity between neighborhoods of data samples. This approach addresses some limitations of traditional Canonical Correlation Analysis (CCA) by directly optimizing for the data analysis task at hand. The criterion used in this method takes into account the well-defined retrieval task, capturing nonlinear and local similarities, and measuring dependency of data relationships rather than only individual data coordinates. This is a significant contribution as it allows for a more comprehensive understanding of local dependencies between multiple views. The experimental evaluation presented in the paper demonstrates the superiority of the proposed method compared to alternatives in preserving cross-view neighborhood similarities. This reaffirms the effectiveness of the approach in addressing the specific task of neighbor retrieval. Additionally, the insights gained from the experiments shed light on the local dependencies between multiple views, further validating the usefulness of the proposed method. However, there are a few areas that could be further improved in this paper. Firstly, the authors could provide more details about the specific datasets used in the experiments, including their characteristics and how they relate to real-world scenarios. Secondly, the authors could consider discussing the computational complexity of their method compared to existing approaches, as this can be an important factor in practical applications. Lastly, it would be valuable to explore additional evaluation metrics or scenarios to further demonstrate the generalizability and robustness of the proposed method. Overall, this paper presents a novel approach for finding dependent subspaces of multiple views, which has the potential to contribute significantly to the field of multi-view learning.","label":34}
{"id":"1466022e-dc33-4215-86f2-ddf395d64a5a","text":"The authors develop a way learn subspaces of multiple views such that data point neighborhoods are similar in all of the views.  This similarity is measured between distributions of neighbors in pairs of views. The motivation is that this is a natural criterion for information retrieval.\r\n\r\nI like the idea of preserving neighborhood relationships across views for retrieval tasks. And it is nice that the learned spaces can have different dimensionalities for different views.  However, the empirical validation seems preliminary.\r\n\r\nThe paper has been revised from the authors' ICLR 2016 submission, and the revisions are welcome, but I think the paper still needs more work in order to be publishable.  \r\n\r\nFirstly, the paper lacks a thorough comparison with existing methods. While the proposed method outperforms alternatives in preserving cross-view neighborhood similarities, it would strengthen the paper to include a more extensive evaluation against a broader range of state-of-the-art techniques. This would provide a more comprehensive understanding of the method's strengths and limitations.\r\n\r\nAdditionally, the experimental results could benefit from more detailed analysis. While the authors mention that the proposed method yields insights into local dependencies between multiple views, they do not elaborate on these insights. Providing specific examples or case studies that highlight the importance and value of these local dependencies would make the paper more compelling.\r\n\r\nIn terms of the writing style, the paper is generally well-written and structured. However, there are a few areas that could be improved. The introduction and related work sections could be expanded to provide a more comprehensive overview of the existing literature in the field of information retrieval and multi-view learning. This would help the reader better understand the context and significance of the proposed method.\r\n\r\nFurthermore, the methodology section could benefit from more clarity and detail. While the main ideas are explained adequately, additional explanations and illustrations of the optimization process and the specific criteria used for evaluating cross-view neighborhood similarities would enhance the reproducibility of the method.\r\n\r\nOverall, the paper presents an interesting approach for finding dependent subspaces of multiple views optimized for information retrieval tasks. However, further evaluation and analysis, along with improvements in the writing style and methodology section, are necessary for the paper to be considered publishable. I would recommend the authors to address these concerns before resubmitting the paper.","label":111}
{"id":"4e8940f8-e9aa-41c5-b3bb-9c7392f20d84","text":"This paper proposed to use Generalized Advantage Estimation (GAE) to optimize DNNs for information seeking tasks. The task is posed as a reinforcement learning problem where agents are required to gather information efficiently in a partially-observed environment. The authors combine deep architectures with reinforcement learning techniques to develop agents capable of actively searching for fragments of information to accomplish various goals. The key contribution of this paper lies in the combination of extrinsic and intrinsic rewards to shape the behavior of the agents. Through empirical evaluation, the authors demonstrate that the proposed agents effectively reduce uncertainty by actively and intelligently seeking new information, while also exploiting the information they have already acquired. The use of Generalized Advantage Estimation (GAE) for optimizing the deep neural networks in this information-seeking framework is also highlighted. Overall, this paper provides valuable insights into the training and testing of information-seeking agents and sheds light on the potential of combining deep architectures with reinforcement learning techniques for solving complex, partially-observed tasks.","label":24}
{"id":"87885855-6b42-4b52-bafe-026d8d179305","text":"This paper proposes information gain as an intermediate reward signal to train deep networks to answer questions. The motivation and model are interesting, however the experiments fail to deliver. There is a lack of comparative simple baselines, the performance of the model is not sufficiently analyzed, and the actual implementation details of the model are not clearly explained. Additionally, the evaluation metrics used to measure the performance of the model are not discussed in depth. The paper would benefit from providing more details on the experimental setup and addressing these limitations. Overall, the paper presents a promising approach for training information-seeking agents, but further improvements are necessary to strengthen its contribution.","label":49}
{"id":"6ea86241-3bab-431c-bb52-762319c15edd","text":"Pros:\r\n\r\n* The general idea behind the paper seems pretty novel and potentially quite cool.\r\n* The specific technical implementation seems pretty reasonable and well-thought through.\r\n* The general types of the tasks that they try out their approach on spans a wide and interesting spectrum of cognition abilities. \r\n* The writing is pretty clear.  I basically felt like I could replicate much of what they did from their paper descriptions. \r\n\r\n\r\nCons:\r\n\r\n* The evaluation of the success of these ideas, as compared to other possible approaches, or as compared to human performance on similar tasks, is extremely cursory. \r\n\r\n* The specific tasks that they try are quite simple.   I really don't know whether their approach is better than a bunch of simpler things on these tasks.   \r\n\r\nTaking these two cons together, it feels like the authors basically get the implementation working, but they don't thoroughly explore its effectiveness or compare it to other potential approaches or human performance on similar tasks. Additionally, the simplicity of the tasks they test their approach on raises questions about whether their method is truly superior or if a simpler approach could achieve similar results. Overall, it seems that the authors have made progress in developing a novel idea and implementing it in a technically sound manner. However, the lack of thorough evaluation and the simplicity of the tasks leave room for further exploration and validation of their approach. It would be beneficial for the authors to conduct more extensive experiments and comparisons to provide a more robust evaluation of their information-seeking agents. Doing so would strengthen their claims and contribute to a deeper understanding of their approach and its potential applications.","label":142}
{"id":"0bf01705-3e94-4ad4-bdd8-a75a1937eb34","text":"This paper proposes a setting to learn models that will seek information (e.g., by asking question) in order to solve a given task. They introduce a set of tasks that were designed for that goal. They show that it is possible to train models to solve these tasks with reinforcement learning.\r\n\r\nOne key motivation for the tasks proposed in this work are the existence of games like 20Q or battleships where an agent needs to ask questions to solve a given task. It is quite surprising that the authors do not actually consider these games as potential tasks to explore (beside the Hangman). It is also not completely clear how the tasks have been selected. A significant amount of work has been dedicated in the past to understand the property of games like 20Q (e.g., Navarro et al., 2010) and how humans solve them..Additionally, while the authors focus on developing agents that actively seek information, they do not address the potential ethical implications of this type of behavior. In the real world, information-seeking agents could potentially invade individuals' privacy or expose sensitive information. It would have been beneficial for the authors to discuss the importance of incorporating privacy and security measures into the design and training of these agents. Another aspect that could have been explored further is the transferability of the learned models to different domains. The authors demonstrate the effectiveness of their approach on the specific tasks they introduce but do not investigate whether the agents can generalize their information-seeking abilities to other environments or tasks. Understanding the generalization capabilities of these models would provide valuable insights into their potential applications in various real-world scenarios.Furthermore, the authors mention combining extrinsic and intrinsic rewards to shape the behavior of their agents. Although they briefly touch upon the concept, they do not delve into the specific formulation of these rewards or how they are balanced. Elaborating on this aspect would have provided a deeper understanding of how the agents learn to actively search for new information and exploit existing knowledge.In terms of the empirical demonstration of the agents' learning capabilities, the authors provide a thorough evaluation of their models, showing that the agents successfully reduce uncertainty and efficiently gather and utilize information. However, it would have been beneficial for the authors to include a comparison with existing information-seeking models or other relevant baselines to assess the performance of their approach against previous work in the field. This would have provided a clearer benchmark for understanding the significance of their results.Overall, the paper presents an intriguing framework for training information-seeking agents and introduces a set of tasks that facilitate the evaluation of these agents. The combination of deep architectures and reinforcement learning techniques appears to be a promising approach for developing intelligent agents capable of actively searching for information. However, there are several areas that could have been expanded upon or further investigated to strengthen the overall contribution of the work. Addressing the ethical implications, investigating transferability to different domains, providing more details on the formulation of rewards, and including comparisons with existing models would enhance the clarity and significance of the research findings.","label":143}
{"id":"058ee69f-4a2f-4c40-b8d2-0af9df0ff852","text":"This paper proposed to use Generalized Advantage Estimation (GAE) to optimize DNNs for information seeking tasks. The task is posed as a reinforcement learning problem and the proposed method explicitly promotes information gain to encourage exploration.\r\n\r\nBoth GAE and DNN have been used for RL before. The novelty in this paper seems to be the explicit modeling of information gain..The authors' approach of combining deep architectures with reinforcement learning techniques to train information-seeking agents is an interesting contribution to the field. By shaping the behavior of the agents using both extrinsic and intrinsic rewards, the authors demonstrate that the agents are able to actively and intelligently search for new information to reduce uncertainty and exploit previously acquired information. While Generalized Advantage Estimation (GAE) and deep neural networks (DNNs) have been used in reinforcement learning before, the novelty of this paper lies in the explicit modeling of information gain. This approach could have significant implications in various domains that require efficient information gathering and decision-making processes. However, further experiments and analysis are needed to fully evaluate the performance and scalability of the proposed method.","label":59}
{"id":"c6a2660d-59a0-405b-b4cb-b1ced30c30dd","text":"This paper proposed to use Generalized Advantage Estimation (GAE) to optimize DNNs for information seeking tasks. The task is posed as a reinforcement learning problem and the proposed approach combines deep architectures with techniques from reinforcement learning to develop agents that solve the information-seeking tasks presented in the paper. The authors effectively shape the behavior of these agents by utilizing both extrinsic and intrinsic rewards. Specifically, they apply Generalized Advantage Estimation (GAE) to optimize Deep Neural Networks (DNNs) for information-seeking tasks. By framing the problem as a reinforcement learning problem, the agents are trained to actively and intelligently search for fragments of information in a partially-observed environment. The authors empirically demonstrate that the trained agents successfully reduce uncertainty by actively seeking new information and effectively exploit the existing knowledge they have acquired. Overall, this work represents a promising and novel approach towards developing information-seeking agents that can efficiently gather information to accomplish various goals.","label":28}
{"id":"1aab44eb-9740-48fe-adf4-1bcbe1e59d67","text":"This paper proposes information gain as an intermediate reward signal to train deep networks to answer questions. The motivation and model are interesting, however the experiments fail to deliver. There is a lack of comparative simple baselines, the performance of the model is not well-documented, and the scalability of the approach is not adequately addressed. Additionally, the evaluation metrics used are not clearly defined, making it difficult to assess the effectiveness of the proposed methods. Overall, while the idea of using information gain as a reward signal is promising, further work is needed to address the limitations mentioned above and provide a more thorough evaluation of the approach.","label":44}
{"id":"7690d94a-31d7-4300-97eb-7b9bcb43f94e","text":"Pros:\r\n\r\n* The general idea behind the paper seems pretty novel and potentially quite cool.\r\n* The specific technical implementation seems pretty reasonable and well-thought through.\r\n* The general types of the tasks that they try out their approach on spans a wide and interesting spectrum of cognition abilities. \r\n* The writing is pretty clear..Cons:\r\n\r\n* However, the paper could benefit from a more detailed discussion of the limitations of their approach. While the tasks they present cover a wide range of cognitive abilities, it would be useful to know if there are any particular tasks that their agents struggle with and why.\r\n* Additionally, the empirical demonstration of the agent's learning capabilities could be strengthened. While the authors state that the agents learn to search actively and intelligently, it would be helpful to have more quantitative analysis and comparison with baseline methods to support this claim.\r\n* The paper would also benefit from more thorough explanations of the reinforcement learning techniques used in training the agents. Although the technical implementation is well-thought through, some parts of the methodology could be better explained, especially for readers unfamiliar with reinforcement learning.\r\n* Lastly, the evaluation metrics used to assess the agent's performance could be more comprehensive. It would be interesting to see how the agent's performance changes with variations in the environment complexity, task difficulty, or information fragment size.\r\n\r\nOverall, this paper presents an intriguing concept and demonstrates promising results in training information-seeking agents. However, to further strengthen the paper, addressing the above-mentioned limitations and providing additional analysis and evaluation would greatly enhance the contribution of this work.\r\n}","label":52}
{"id":"cc54e33a-83c6-45dc-b3dc-f5d28ebb6611","text":"This paper proposes a setting to learn models that will seek information (e.g., by asking question) in order to solve a given task. They introduce a set of tasks that were designed for that goal. They show that it is possible to train models to solve these tasks with reinforcement learning.\r\n\r\nOne key motivation for the tasks proposed in this work are the existence of games like 20Q or battleships where an agent needs to ask questions to solve a given task. It is quite surprising that the authors do not explore the possibility of using existing game environments to formulate their tasks. There are already well-established game environments for information-seeking tasks, such as text-based adventure games or question-answering games. It would have been interesting to see how the models developed in this work perform on these existing game environments and compare them to other state-of-the-art models.\r\n\r\nAdditionally, while the authors show that the proposed models successfully learn to search actively and gather information efficiently, it would have been beneficial to provide a more in-depth analysis of the limitations of the models. Are there any scenarios or task settings where the models struggle to effectively gather information? Are there cases where the models get stuck in local optima or fail to explore new avenues of information? Understanding these limitations could help guide future research in improving the proposed models.\r\n\r\nAnother aspect that could be further explored is the combination of different reinforcement learning techniques with the proposed deep architectures. The paper briefly mentions the use of intrinsic and extrinsic rewards to shape the behavior of the agents, but the specific techniques and algorithms used are not extensively discussed. Providing more details and insights into these reward mechanisms would enhance the understanding of the models and their training process.\r\n\r\nIt would also be interesting to see if the proposed models can generalize to other domains or tasks. While the tasks presented in this work are specifically designed for information-seeking, it would be valuable to investigate how well the trained models can adapt to different environments or problem domains. Can the models transfer their learned information-seeking strategies to new tasks without extensive retraining?\r\n\r\nOverall, this paper presents a promising approach to training information-seeking agents and introduces a set of tasks that can serve as benchmarks in this area. The empirical results demonstrate the effectiveness of the proposed models in actively searching for new information and exploiting acquired knowledge. The authors could further strengthen their work by comparing their models to existing game environments, providing a thorough analysis of the models' limitations, delving into the specifics of the reward mechanisms, and exploring the models' generalization capabilities. Building upon this foundation, future research can continue to advance the field of information-seeking agents and contribute to the development of intelligent systems that can actively acquire and utilize information to accomplish complex tasks.","label":89}
{"id":"c4a5ca36-8bc9-4981-8f4d-08a1f722e639","text":"This paper proposed to use Generalized Advantage Estimation (GAE) to optimize DNNs for information seeking tasks. The task is posed as a reinforcement learning problem and the proposed method explicitly promotes information gain to encourage exploration.\r\n\r\nBoth GAE and DNN have been used for RL before. The novelty in this paper seems to be the explicit promotion of information gain in the optimization of DNNs for information seeking tasks. This is achieved through the combination of deep architectures and reinforcement learning techniques. The authors demonstrate the effectiveness of their approach through empirical evaluation, showing that the agents are able to actively search for new information and exploit already acquired knowledge to reduce uncertainty. While both GAE and DNN have been utilized in RL before, the explicit focus on information gain sets this paper apart. This work opens up new avenues for research in training and testing agents to efficiently gather information, which has practical applications in various domains.","label":55}
{"id":"ee8493b9-b23c-485d-9805-bc75f4a4dcfa","text":"The authors propose a RNN-method for time-series classification with missing values, that can make use of potential information in missing values..This is a significant contribution to the field as it addresses the limited work on exploiting missing patterns for effective imputation and improving prediction performance. The use of deep learning models, specifically GRU-D, demonstrates the authors' innovative approach to incorporating missing patterns into the model architecture. The experiments conducted on real-world clinical datasets and synthetic datasets showcase the state-of-the-art performance of the proposed models. Furthermore, the insights provided in this paper contribute to a better understanding and utilization of missing values in time series analysis, making it a valuable resource for researchers in various domains.","label":21}
{"id":"3e2396b1-f0f8-422e-92f2-16733ce8f278","text":"This paper presents a modification of GRU-RNNs to handle missing data explicitly, allowing them to effectively incorporate missing patterns into the model architecture. By introducing two representations of missing patterns, masking and time interval, the GRU-D model not only captures long-term temporal dependencies in time series but also utilizes missing values for improved prediction results. The experimental results on real-world clinical and synthetic datasets demonstrate the state-of-the-art performance of our proposed models and the potential for better understanding and utilization of missing values in time series analysis.","label":14}
{"id":"047523e3-062e-4b8c-bf73-db0b9245222d","text":"This paper presents a modified gated RNN caled GRU-D that deals with time series which display a lot of missing values in their input. They work on two fronts. The first deals with the missing inputs directly by using a learned convex combination of the previous available value (forward imputation) and the mean value (mean imputation). The second includes dampening the recurrent layer not unlike a second reset gate, but parametrized according to the time elapsed since the last available value of each attributes.\r\n\r\nPositives\r\n------------\r\n- Clear definition of the task (handling missing values for classification of time series)\r\n- Many interesting baselines to test the new model against.\r\n- The model presented deals with the missing values in a novel, ML-type way (learn new dampening parameters).\r\n- The extensive tests done on the datasets is probably the greatest asset of this paper.\r\n\r\nNegatives\r\n-------------\r\n- The paper could use some double checking for typos.\r\n- The Section A.2.3 really belongs in the main article as it deals with important related works. Swap it with the imprecise diagrams of the model if you need space.\r\n- No mention of any methods from the statistics litterature.\r\n\r\nHere are the two main points of this review that informs my decision:\r\n\r\n1. The results, while promising, are below expectations..The results, while promising, are below expectations. The paper claims to achieve state-of-the-art performance on both real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets. However, upon closer inspection of the results, it is clear that the performance improvements achieved by the GRU-D model are relatively modest compared to the baselines. The improvement in classification accuracy over the best baseline model is marginal, indicating that the proposed approach may not be significantly better than existing methods for handling missing values in time series data.\r\n\r\nAdditionally, the authors do not provide a thorough analysis of the limitations of their approach. While they mention that their method may struggle with long intervals of missing data, they do not provide any empirical evidence to support this claim. Furthermore, the experiments are primarily conducted on healthcare datasets, and it is unclear how well the proposed approach would generalize to other domains.\r\n\r\nAnother concern is the lack of comparison with existing statistical methods for handling missing values in time series data. The authors claim that there is limited work in this area, but fail to acknowledge the vast literature on imputation methods in the statistics community. It would be valuable to compare the performance of the GRU-D model against some of these established statistical methods to gain a better understanding of its efficacy.\r\n\r\nLastly, the paper could benefit from a more detailed explanation of the model architecture and the motivation behind the design choices. While the authors briefly mention the use of a convex combination for imputation and the time-based dampening parameter, a more thorough explanation would enhance the reader's understanding of the model.\r\n\r\nIn conclusion, while the proposed GRU-D model presents an interesting approach for handling missing values in time series data, the results are not as compelling as claimed and lack a comprehensive evaluation. The paper would benefit from addressing the aforementioned issues and providing a thorough comparison with existing statistical methods. Overall, it is a valuable contribution to the field but requires further refinement to fully establish its advantages over existing approaches.","label":204}
{"id":"61501911-2021-4aee-90ed-50783b4f5106","text":"This paper proposed a way to deal with supervised multivariate time series tasks involving missing values. The high level idea is still using the recurrent neural network (specifically, GRU in this paper) to do sequence supervised learning, e.g., classification, but modifications have been made to the input and hidden layers of RNNs to tackle the missing value problem. \r\n\r\npros: \r\n1) the insight of utilizing missing value is critical. the observation of the correlation between missing values and target labels, and incorporating the missing patterns into the model architecture is a novel approach. This can potentially improve prediction performance in time series analysis. \r\n\r\ncons: \r\n1) The paper lacks thorough experimental validation. While the authors mention experiments on real-world clinical datasets and synthetic datasets, more details on the experimental setup and results would enhance the credibility of their claims. \r\n2) It would also be beneficial to compare the proposed GRU-D model with other existing methods for handling missing values in time series analysis. Overall, this paper presents an interesting approach, but further experimentation and comparative analysis are needed to fully evaluate its effectiveness.","label":71}
{"id":"4968e526-9cb7-4c85-a712-7de000baa667","text":"The authors propose a RNN-method for time-series classification with missing values, that can make use of potential information in missing values. It is based on Gated Recurrent Units (GRU), a state-of-the-art recurrent neural network. The proposed method, called GRU-D, incorporates two representations of missing patterns, namely masking and time interval, into the deep model architecture. This allows GRU-D to capture long-term temporal dependencies in time series while utilizing the missing patterns for improved prediction performance. The authors validate the effectiveness of GRU-D through extensive experiments on both real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets. The results demonstrate that GRU-D achieves state-of-the-art performance in time series classification tasks and provides valuable insights into the utilization of missing values in time series analysis. Overall, this paper presents a novel and promising approach for addressing missing values in multivariate time series data and contributes to advancing the field of time series analysis.","label":24}
{"id":"a33db263-5e97-48af-ad99-fbdf73b9b7f3","text":"Thank you for a very interesting work! \r\n\r\n\r\nI have a couple of questions:\r\n\r\n1) Lipton at al. (2016) achieve the best result with zero filling t it seems like your approach of incorporating missing patterns into the deep model architecture in GRU-D can potentially provide better prediction results. Have you compared GRU-D with other imputation techniques, such as zero filling or mean imputation, to evaluate its effectiveness? Additionally, it would be helpful to provide more details on how the masking and time interval representations are incorporated in GRU-D. Overall, your experiments with real-world clinical datasets and synthetic datasets demonstrate the potential of your models in achieving state-of-the-art performance. Great work!","label":24}
{"id":"81025255-0928-472a-aa0c-e6a571962414","text":"The authors propose a RNN-method for time-series classification with missing values, that can make use of potential information in missing values. It is based on a simple linear imputation of missing values with learnable parameters. Furthermore, time-intervals between missing values are computed and used to scale the RNN computation downstream..The authors have made a significant contribution to the field by developing GRU-D, a novel deep learning model for time series classification with missing values. By incorporating two representations of missing patterns (masking and time interval), GRU-D effectively captures long-term temporal dependencies in time series data while also exploiting the informative missingness. The experiments conducted on real-world clinical and synthetic datasets demonstrate the state-of-the-art performance of GRU-D, providing valuable insights into the analysis and utilization of missing values in time series analysis.","label":50}
{"id":"6400d5fb-20c1-47e0-8c29-7dee1cf09355","text":".The paper presents a novel deep learning model called GRU-D for handling missing values in multivariate time series. It incorporates two representations of missing patterns, masking and time interval, into a deep model architecture based on Gated Recurrent Units (GRU). The experiments conducted on both real-world clinical datasets and synthetic datasets demonstrate that GRU-D achieves state-of-the-art performance in time series classification tasks. The paper also provides valuable insights into the understanding and utilization of missing values in time series analysis. Overall, the paper is well-written and makes a significant contribution to the field of time series analysis with missing values.","label":0}
{"id":"03df6315-d6aa-4348-a9ea-3abf7efafa36","text":"This paper presents a modified gated RNN caled GRU-D that deals with time series which display a lot of missing values in their input. They work on two fronts. The first deals with the missing inputs directly by using a learned convex combination of the previous available value (forward imputation) and the mean value (mean imputation). The second includes dampening the recurrent layer not unlike a second reset gate, but parametrized according to the time elapsed since the last available value of each attributes.\r\n\r\nPositives\r\n------------\r\n- Clear definition of the task (handling missing values for classification of time series)\r\n- Many interesting baselines to test the new model against.\r\n- The model presented deals with the missing values in a novel, ML-type way (learn new dampening parameters).\r\n- The extensive tests done on the datasets is probably the greatest asset of this paper.\r\n\r\nNegatives\r\n-------------\r\n- The paper could use some double checking for typos.\r\n- The Section A.2.3 really belongs in the main article as it deals with important related works. Swap it with the imprecise diagrams of the model if you need space.\r\n- No mention of any methods from the statistics litterature.\r\n\r\nHere are the two main points of this review that informs my decision:\r\n\r\n1. The results, while promising, are below expectations. The paper hasn\u2019t been able to convince me that GRU-simple (without intervals) isn\u2019t just as well-suited for the task of handling missing inputs as GRU-D. In the main paper, GRU-simple is presented as the main baseline. Yet, it includes a lot of extraneous parameters (the intervals) that, according to Table 5, probably hurts the model more than it helps it. Having a third of it\u2019s parameters being of dubious value, it brings down the overall performance of GRU-D. This raises the question of whether the additional complexity introduced by the interval-based approach is necessary. The authors should provide a more detailed analysis and justification for the inclusion of intervals in the model, showcasing scenarios or datasets where GRU-D outperforms GRU-simple. Without this additional evidence, it would be difficult to understand the true value and significance of GRU-D.\r\n\r\nAdditionally, the paper could benefit from a deeper discussion of the limitations and potential drawbacks of the proposed approach. For instance, how robust is the model to different levels of missingness in the input? Are there any cases where the model may struggle? Addressing these questions would further strengthen the paper and provide a more comprehensive understanding of the model's capabilities and limitations.\r\n\r\nLastly, it would be valuable to discuss the potential implications and real-world applications of the proposed GRU-D model. How can this approach be effectively utilized in practical domains such as health care or geoscience? Providing concrete examples or case studies would help readers understand the practical implications and motivate further research in this direction.\r\n\r\nIn conclusion, the paper presents an interesting approach, GRU-D, for handling missing values in multivariate time series data. The authors have made a significant contribution by incorporating missing patterns into the deep learning model architecture. The extensive experiments conducted on real-world clinical datasets and synthetic datasets demonstrate the effectiveness of the proposed approach. However, to strengthen the paper, the authors should provide a more thorough analysis of the benefits and limitations of the GRU-D model, as well as discuss potential real-world applications. With these improvements, the paper has the potential to make a substantial impact in the field of time series analysis and contribute to the development of more accurate and robust models for handling missing values.","label":276}
{"id":"1b4bb117-be4f-49a8-a26c-ab583b550b2b","text":"This paper proposed a way to deal with supervised multivariate time series tasks involving missing values. The high level idea is still using the recurrent neural network (specifically, GRU in this paper) to do sequence supervised learning, e.g., classification, but modifications have been made to the input and hidden layers of RNNs to tackle the missing value problem. \r\n\r\npros: The paper's approach, called GRU-D, incorporates two representations of missing patterns, masking and time interval, into the deep learning model architecture. By doing so, it not only captures the long-term temporal dependencies in time series data, but also uses missing patterns to improve prediction results. The experiments conducted on both real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets demonstrate that the proposed model achieves state-of-the-art performance. This is a significant contribution, as there has been limited work in exploiting missing patterns for effective imputation and prediction performance improvement in time series analysis. The results provide valuable insights for future research in this domain.","label":59}
{"id":"2cf43e21-48ed-43c2-9f2c-4d33174910d0","text":"The authors propose a RNN-method for time-series classification with missing values, that effectively incorporates missing patterns into the deep model architecture, allowing it to capture long-term temporal dependencies and improve prediction results. The proposed method, GRU-D, utilizes two representations of missing patterns, masking and time interval, within the Gated Recurrent Units (GRU) framework. By doing so, it not only achieves state-of-the-art performance in time series classification tasks on real-world clinical datasets like MIMIC-III and PhysioNet, but also provides valuable insights for understanding and leveraging missing values in time series analysis. This paper fills a gap by addressing the challenge of informative missingness and shows promising results in this relatively unexplored area. Overall, the proposed GRU-D model is a significant contribution to the field of multivariate time series analysis with missing values.","label":12}
{"id":"56d8446f-e0ad-436d-ba76-1e54bdfe0a36","text":"Thank you for a very interesting work! \r\n\r\n\r\nI have a couple of questions:\r\n\r\n1) Lipton at al. (2016) achieve the best result with zero filling and indicators. As I understood 1) Lipton at al. (2016) achieve the best result with zero filling and indicators. As I understood, GRU-D also utilizes the missing patterns for imputation and prediction, but it would be helpful to compare the performance of GRU-D with the approach proposed by Lipton et al. Did you consider this comparison in your experiments? Could you provide more details about the experimental setup and the performance metrics used? Additionally, it would be interesting to know if GRU-D outperforms other existing imputation methods for time series data with missing values. Overall, the paper presents a promising approach for dealing with missing values in multivariate time series, and I look forward to reading more about the experimental results.","label":29}
{"id":"88e06203-9c95-4f0c-adce-bc2ec5db9d23","text":"This paper proposes a model for the task of argumentation mining (labeling the set of relationships between statements expressed as sentence-sized spans in a short text). The model combines a pointer network component that identifies links between statements and a classifier that predicts the roles of these statements. The resulting model works well: It outperforms state-of-the-art models on two separate evaluation corpora. The authors propose a modification of the Pointer Network architecture to handle the sequential nature of argument components and enforce the tree structure present in argument relations. The joint model developed in this work not only predicts links between argument components but also learns the type of argument component. This is achieved by optimizing for both tasks and incorporating a fully-connected layer prior to the recurrent neural network input. Experimental results demonstrate the significance of these components in achieving high performance. The proposed model achieves state-of-the-art results on two separate evaluation corpora, showcasing its effectiveness in uncovering the argument structure in argumentative text. However, it would be helpful to have more details on the evaluation metrics used and how the model compares to existing approaches in terms of precision, recall, and F1 score. Overall, this paper presents an innovative neural network-based approach to argumentation mining and makes significant contributions to the field.","label":55}
{"id":"84747904-221c-4c6a-a01b-70337fcca384","text":"The paper presents an interesting application of pointer networks to the argumentation mining task, and the reviewers found it generally solid. The reviewers generally found the proposed model to be effective in extracting links between argument components and classifying types of argument components. The authors justified their choice of using a Pointer Network architecture, highlighting its advantages in capturing the sequential nature and tree structure of arguments. The joint model presented in the paper, which combines component classification and link prediction, achieved state-of-the-art results on two separate evaluation corpora. The experimental results also demonstrated the importance of optimizing for both tasks and incorporating a fully-connected layer prior to the recurrent neural network input. Overall, the paper makes a valuable contribution to the field of argumentation mining and provides insights for future research directions.","label":24}
{"id":"2de1cab4-346a-4ab1-a962-68cfa9683190","text":"We would like to thank the reviewers for their comments, and specifically, reviewers #2 and #3, who acknowledge that the proposed model is interesting and works well, outperforming strong baselines.  To summarize, the main contribution of our work is to propose a novel joint model based on pointer network architecture which achieves state-of-the-art results on argumentation mining task with a large gap.  \r\n\r\nThere are three main concerns raised by the reviewers. The first and the main concern is the novelty of the model.  We believe that in part this concern is due to a misunderstanding that occurred because we mislabeled our proposed joint model as \u201cPN\u201d in the results table (see our discussion below).  We think this led some of the reviewers to believe that the paper merely pointer network model to a specific task. \r\n\r\nThe second concern is about the overall contribution of the paper to representation learning.  We argue that it is precisely the joint representation learned by our model in the encoding phase, as well as the fact that our model supports separate source and target representations for a given text span, that allows us to substantially outperform standard recurrent models.  \r\n\r\nThe third question raised by the reviewers concerns the meaningful comparison to other methods for recovering relations, such as stack-based models for syntactic parsing.  We believe that this comparison is not appropriate in our case.  As a discourse parsing task, argumentation mining requires the flexibility of recovering relations that are quite distinct from syntactic parsing, in that they allow both projective and non-projective structures, multi-root parse fragments, and components with no incoming or no outgoing links. \r\n\r\nWe give more specific responses to reviewer comments below.\r\n\r\n\u201cPointer network has been proposed before.\u201d \r\n\r\nAs is evident in Table 1, a direct application of a pointer network (PN) does not achieve state-of-the-art on link prediction.  Neither is it suitable for AC type prediction.  In fact, a direct application of PN performs substantially worse on link prediction than the joint model, which achieves state-of-the-art on both tasks in the persuasive essay corpus, as well as on link prediction in the microtext corpus. \r\n\r\n\u201cI am concerned, though, that the paper doesn't make a substantial novel contribution to representation learning.\u201d \r\n\r\nWe argue that the better performance of the joint PN model is purely due We argue that the better performance of the joint PN model is purely due to the joint representation learned by our model in the encoding phase. Unlike standard recurrent models, our model supports separate source and target representations for a given text span, enabling it to capture the hierarchical structure of argument components. This ability to encode and represent the argument structure is a novel contribution to representation learning in the context of argumentation mining.\r\n\r\nRegarding the comparison to stack-based models for syntactic parsing, we acknowledge that argumentation mining requires different parsing capabilities. The goal of our work is to uncover the argument structure in argumentative text, which involves recovering relations that are distinct from syntactic parsing. These relations can include non-projective structures, multi-root parse fragments, and components with no incoming or outgoing links. Stack-based models are designed for syntactic parsing and may not be suitable for this specific task.\r\n\r\nIn conclusion, we would like to address the concerns raised by the reviewers and emphasize the contributions of our work. Our proposed joint model based on the pointer network architecture achieves state-of-the-art results on the argumentation mining task and outperforms strong baselines by a large margin. The joint representation learned by our model in the encoding phase and the ability to capture the hierarchical structure of argument components are key factors contributing to its superior performance. We believe that these contributions make our work novel and valuable to the field of argumentation mining.\r\n\r\nWe will make the necessary revisions and additions to the paper to address the concerns and clarify any misunderstandings. We appreciate the reviewers' feedback and constructive criticism, and we believe that their comments will help improve the quality and impact of our work. Thank you.","label":391}
{"id":"fd37aee1-b56b-48ea-a9e4-0891eb8b3098","text":"This paper addresses automated argumentation mining by proposing a neural network-based approach that focuses on extracting links between argument components and classifying types of argument components. They use a modified version of a Pointer Network architecture which considers the sequential nature of argument components and enforces the tree structure of argument relations. The authors also present a joint model that simultaneously learns the type of argument component and predicts links between them. Their approach achieves state-of-the-art results on two evaluation corpora, highlighting the importance of optimizing for both tasks and adding a fully-connected layer prior to the recurrent neural network input.","label":4}
{"id":"b9c6e738-6a94-4a0f-a3d4-1bfe788453b2","text":"This paper addresses the problem of argument mining, which consists of finding argument types and predicting the relationships between the arguments. The authors proposed a pointer network structure to recover the argument relations. They also propose modifications on pointer network to perform joint training on both type and link prediction tasks. Overall the model is reasonable, but I am not sure if ICLR is the best venue for this work.\r\n\r\nMy first concern of the paper is on the novelty of the model. Pointer network has been proposed before. The proposed multi-task learning method is interesting, but the authors only verified it on one task. This makes me feel that maybe the submission is more for a NLP conference rather than an AI\/ML conference. Additionally, the authors claim state-of-the-art results on two separate evaluation corpora, but they do not provide sufficient details or comparison with existing methods to support this claim. It would be helpful to see a more thorough experimental evaluation, including comparisons with relevant baselines and an analysis of the strengths and limitations of the proposed approach. Furthermore, the paper would benefit from a more comprehensive discussion of related work in the field of argumentation mining. While the authors briefly mention some previous work on argument structure and classification, there are likely other relevant approaches and techniques that should be addressed. Finally, the writing in the paper can be improved. There are several grammatical errors and awkward phrasings that make it difficult to understand certain parts of the paper. Overall, while the proposed model has some interesting aspects, there are certain areas that need to be addressed and improved to make the paper more suitable for ICLR. I would suggest the authors focus on providing a more detailed experimental evaluation, addressing the concerns mentioned above, and revising the writing to make it more clear and concise. With these improvements, the paper could be a valuable contribution to the field of argumentation mining.","label":120}
{"id":"698f2a74-1339-4231-bc27-6601e9ef1985","text":"This paper proposes a model for the task of argumentation mining (labeling the set of relationships between statements expressed as sentence-sized spans in a short text). The model combines a pointer network component that identifies links between statements and a classifier that predicts the roles of these statements..This paper proposes a model for the task of argumentation mining, which aims to uncover the argument structure present in argumentative text by understanding the linkage between different individual components of the overall argument. The existing consensus in this field suggests that the argument components form a hierarchy of persuasion, resembling a tree structure. To address this problem, the authors propose a neural network-based approach using a modification of a Pointer Network architecture. The key advantage of this architecture is that it considers the sequential nature of argument components and enforces certain properties of the argument relations tree structure. Additionally, the authors extend the original Pointer Network model by constructing a joint model that simultaneously learns the type of argument component and predicts the links between them. The experimental results demonstrate that the proposed model achieves state-of-the-art performance on two evaluation corpora. Moreover, the findings indicate that optimizing for both tasks and incorporating a fully-connected layer prior to the recurrent neural network input are essential for obtaining high performance. Overall, this paper presents a novel and effective approach to argumentation mining, contributing to the advancement of automated analysis of argumentative texts.","label":48}
{"id":"eeec719b-b3fc-4170-94a0-e47d545040c6","text":"This paper proposes a model for the task of argumentation mining (labeling the set of relationships between statements expressed as sentence-sized spans in a short text). The model combines a pointer network component that identifies links between statements and a classifier that predicts the roles of these statements. The resulting model works well: It outperforms strong baselines, even on datasets with fewer than 100 training examples.\r\n\r\nI don't see any major technical issues with this paper, and the results are strong. I am concerned, though, that the paper doesn't make a substantial novel contribution to representation learning..However, the novelty of this work lies in its application to argumentation mining. The authors propose a modification of the Pointer Network architecture that takes into account the sequential nature of argument components and enforces the tree structure present in argument relations. By simultaneously learning the type of argument component and predicting links between components, the proposed model achieves state-of-the-art results on two evaluation corpora. Additionally, the authors demonstrate the importance of optimizing for both tasks and adding a fully-connected layer prior to the recurrent neural network input for achieving high performance. While the paper may not contribute significantly to representation learning techniques, it makes a valuable contribution to the field of argumentation mining by introducing a neural network-based approach and providing new insights into uncovering the argument structure in textual data.","label":96}
{"id":"c989168e-22bb-4401-9820-6a04254fec6e","text":"The paper presents an interesting application of pointer networks to the argumentation mining task, and the reviewers found it generally solid. The  authors' approach of using a modified Pointer Network architecture to extract links between argument components and classify types of argument components is novel and shows promise. The results presented in the paper demonstrate that their proposed model achieves state-of-the-art performance on two separate evaluation corpora. The addition of a fully-connected layer prior to the recurrent neural network input is revealed to be crucial for achieving high performance, which highlights the importance of architectural modifications. Overall, the paper contributes to the field of argumentation mining by providing a neural network-based approach that effectively handles the sequential nature and hierarchical structure of argument components.","label":22}
{"id":"a21df9da-4caa-467c-b32e-a47aa471b1d1","text":"We would like to thank the reviewers for their comments, and specifically, reviewers #2 and #3, who acknowledge that the proposed model is interesting and works well, outperforming strong baselines.  To summarize, the main contribution of our work is to propose a novel joint model based on pointer network architecture which achieves state-of-the-art results on argumentation mining task with a large gap.  \r\n\r\nThere are three main concerns raised by the reviewers. The first and the main concern is the novelty of the model.  We believe that in part this concern is due to a misunderstanding that occurred because we mislabeled our proposed joint model as \u201cPN\u201d in the results table (see our submission). We apologize for the confusion and will correct this mislabeling in the revised version of the paper. The proposed joint model, which combines a pointer network architecture with a fully-connected layer and is trained to simultaneously learn argument component types and predict argument links, is indeed a novel contribution in the field of argumentation mining. As reviewers #2 and #3 rightly point out, our model outperforms strong baselines and achieves state-of-the-art results on two separate evaluation corpora.\r\n\r\nThe second concern raised by the reviewers is the lack of clarity in describing the model architecture and training process. We acknowledge that the description of the model in the original paper may have been insufficiently detailed, leading to some confusion among the reviewers. We will address this concern in the revised version by providing a more thorough and explicit explanation of the model architecture, training procedure, and hyperparameter settings. We will also include visualizations or diagrams to aid in understanding the model's inner workings. Additionally, we will provide more comprehensive comparisons with existing approaches to clearly demonstrate the advantages and novelty of our model.\r\n\r\nLastly, reviewers #1, #2, and #3 express concerns about the evaluation metrics used in the experiments. Specifically, they question the reliance on exact match accuracy as the main evaluation metric and suggest considering alternative measures like F1 score or precision and recall. We appreciate the reviewers' feedback and agree that a single evaluation metric may not fully capture the performance of the model. In the revised version, we will broaden the evaluation by including additional metrics such as F1 score, precision, recall, and possibly others commonly used in the argumentation mining literature. This will provide a more comprehensive assessment of the proposed model's effectiveness.\r\n\r\nIn conclusion, we are grateful for the reviewers' valuable feedback and suggestions. We will carefully address all the concerns raised in the revised version of the paper, including clarifying the model architecture, providing more comprehensive comparisons, and expanding the evaluation metrics. We believe that incorporating these improvements will significantly enhance the quality and clarity of the paper. Once again, we would like to express our appreciation to the reviewers for their thoughtful comments and constructive criticism.","label":115}
{"id":"d58cbc29-369d-4f23-8104-24100bcbfb72","text":"This paper addresses automated argumentation mining by proposing a neural network-based approach to extract links between argument components and classify argument types. The paper also introduces a modified Pointer Network architecture that considers the sequential nature of argument components and enforces properties of the argument tree structure. The proposed joint model achieves state-of-the-art results on evaluation corpora, showcasing the significance of optimizing for both tasks and adding a fully-connected layer prior to the recurrent neural network input for high performance.","label":6}
{"id":"748e3f62-8a21-4108-895d-10a12c21a13a","text":"This paper addresses the problem of argument mining, which consists of finding argument types and predicting the relationships between the arguments. The authors proposed a pointer network structure to recover the argument relations. They also propose modifications on pointer network to perform joint training on both type and link prediction tasks. Overall the model provides a comprehensive approach to argumentation mining by addressing the challenges of identifying argument types and predicting the relationships between argument components. The use of a pointer network architecture is a novel and effective choice for this task, as it takes into account the sequential nature of argument components and enforces the tree structure present in argument relations. The authors further enhance the model's capabilities by constructing a joint model that simultaneously learns argument types and predicts argument links. This joint training approach contributes to the overall performance of the model and leads to state-of-the-art results on two separate evaluation corpora.\r\n\r\nThe experimental results reported in the paper demonstrate the effectiveness of the proposed model. By optimizing for both argument type classification and link prediction tasks, the model achieves high performance. Additionally, the authors find that introducing a fully-connected layer before the recurrent neural network input is crucial for achieving these results. This shows the importance of effectively capturing the hidden representations of argument components.\r\n\r\nThe contributions made by this work go beyond the development of a novel neural network-based approach for argumentation mining. The authors also provide valuable insights into the structure and hierarchy of arguments, shedding light on how different individual components are linked within the overall argument. This understanding can have implications in various domains, including legal, political, and academic discourse analysis.\r\n\r\nOverall, the paper is well-written and well-structured. The authors provide a clear motivation for their work and thoroughly explain the methodology and experimental setup. The results are carefully analyzed, and the implications and limitations of the proposed model are discussed. However, it would be helpful if the authors could provide more detailed explanations of the specific modifications made to the pointer network architecture to adapt it to the argumentation mining task. This would enhance the reproducibility of the work and facilitate future research in this domain.\r\n\r\nIn conclusion, the paper presents a significant contribution to the field of argumentation mining with its innovative approach and state-of-the-art results. The proposed model's combination of pointer network architecture and joint training strategy showcases the effectiveness of neural networks in uncovering the argument structure and relations in argumentative text. This work opens up new avenues for research and has the potential to impact multiple applications related to argument analysis and understanding.","label":54}
{"id":"82aa9ae0-4266-4faf-bdfd-5095ea845b96","text":"This paper proposes a model for the task of argumentation mining (labeling the set of relationships between statements expressed as sentence-sized spans in a short text). The model combines a pointer network component that identifies links between statements and a classifier that predicts the roles of these statements. The resulting model works well: It outperforms strong baselines, even on datasets with fewer than 100 training examples.\r\n\r\nI don't see any major technical issues with this paper, and the results show that the proposed model is effective and achieves state-of-the-art results on two separate evaluation corpora. The authors also demonstrate the importance of optimizing for both tasks, as well as adding a fully-connected layer prior to the recurrent neural network input, for achieving high performance. One potential limitation of the paper is that it does not explore the scalability of the model when trained on larger datasets. It would be interesting to see how the proposed approach performs when trained on a larger corpus of argumentative text. Additionally, it would also be valuable to see if the model can generalize well to different domains or if it is domain-specific. Overall, this paper provides a significant contribution to the field of argumentation mining by introducing a neural network-based approach that effectively identifies argument components and their relationships. The experimental results and the thorough analysis of the proposed model make this paper a valuable resource for researchers in the field of NLP and argumentation mining.","label":78}
{"id":"fafa744f-feeb-4017-8193-2b7710cc9ecd","text":"This paper addresses video captioning with a TEM-HAM architecture, where a HAM module attends over attended outputs of the TEM module when generating the description. This gives a kind of 2-level attention. The model is evaluated on the Charades and MSVD datasets.\r\n\r\n1. Quality\/Clarify: I found this paper to be poorly written and relatively hard to understand. As far as I can tell the TEM module of Section 3.1 is a straight-forward attention frame encoder of Bahdanau et al. 2015 or Xu et al. 2015. The decoder of Section 3.3 is a standard LSTM with log likelihood. The HAM module of Section 3.2 is the novel module but is not very well described. It looks to be an attention LSTM where the attention is over the TEM LSTM outputs, but the attention weights are additionally conditioned on the decoder state. There are a lot of small problems with the description, such as notational discrepancy in using \\textbf in equations and then not using it in the text. Also, I spent a long time trying to understand what f_m is..Overall, I have some concerns about the clarity and quality of the paper. The writing could be improved to better explain the proposed TEM-HAM architecture and the HAM module in particular. It seems like the TEM module is based on previous attention frame encoders, but it is not explicitly clarified in the paper. Additionally, the HAM module is not well described, and it is difficult to understand how the attention weights are conditioned on the decoder state. The use of notation is inconsistent, which can be confusing for readers. For example, the paper uses \textbf notation in equations but not in the text. There is also a lack of explanation for the variable f_m, leaving readers to speculate its meaning. These issues should be addressed to enhance the clarity and understanding of the proposed model. Moreover, I would suggest providing more details about the experimental setup, such as the hyperparameters used and the training\/validation\/test splits. This information is crucial for reproducibility and allows for a more comprehensive evaluation of the proposed architecture. Despite these concerns, the results presented in the paper are impressive, showing that the proposed model outperforms previous methods on the Charades and MSVD datasets, leading to new state-of-the-art results in video description. However, it would be helpful to provide more insights into the reasons behind the improvement in performance. To summarize, while the paper presents a promising approach for video captioning with a memory-based attention model, further improvements in writing clarity and model description, as well as more thorough experimental details, would greatly enhance the overall quality of the paper.","label":178}
{"id":"598726f6-79ad-43a7-8e0b-af80c8dd783e","text":".In this paper, the authors propose a novel memory-based attention model for video description. They highlight that existing attention models for tasks like image captioning are limited in their ability to capture complex higher-order interactions in videos. To overcome this limitation, the authors introduce a model that incorporates memories of past attention to effectively reason about where to attend to in the current time step. The proposed architecture outperforms previous methods on the MSVD and Charades datasets, setting a new state of the art in video description. Overall, this work presents a valuable contribution to the field of attention modeling for videos.","label":0}
{"id":"af930409-b013-45b9-ab38-147db726713a","text":"The authors propose a \"hierarchical\" attention model for video captioning.  They introduce a model composed of three parts: the temporal modeler (TEM) that takes as input the video sequence and outputs a sequential representation of the video to the HAM; the hierarchical attention\/memory mechanism (HAM) implements a soft-attention mechanism over the sequential video representation; and finally a decoder that generates a caption. \r\n\r\nRelated to the second series of questions above, it seems as though the authors have chosen to refer to their use of an LSTM to model the temporal dynamics of the video sequence. However, it is not explicitly mentioned how the LSTM is used in the model and how it interacts with the attention mechanism. This should be clarified in the paper. Additionally, the authors mention that their proposed architecture outperforms all previously proposed methods, but the evaluation metrics and comparison with existing approaches are not provided. It would be beneficial to include a detailed analysis of the experimental results, including comparisons with baseline models and explanation of the observed performance improvements. Moreover, the MSVD and Charades datasets are mentioned as the evaluation datasets, but no information is provided about them. It is important to provide details about the datasets, such as the number of videos, the number of captions per video, and the distribution of concepts in the datasets. This information is crucial for understanding the generality and significance of the proposed approach. Overall, the paper addresses an important problem in video description\/captioning and presents a novel memory-based attention model. However, there are certain areas that need to be clarified and expanded, particularly in terms of the LSTM integration, evaluation methodology, and dataset descriptions. With these improvements, the paper has the potential to make a valuable contribution to the field.","label":87}
{"id":"95faba4e-d707-440a-aa04-e81849f8d0fe","text":"The paper proposes an attention-based approach for video description. The approach uses three LSTMs and two attention mechanisms to sequentially predict words from a sequence of frames.\r\nIn the LSTM-encoder of the frames (TEM), the first attention approach predicts a spatial attention per frame, and computes the weighted average. The second LSTM (HAM) predicts an attention over the hidden states of the encoder LSTM.\r\nThe third LSTM which run temporally in parallel to the second LSTM generates the sentence, one word at a time.\r\n\r\n\r\nStrength:\r\n===============\r\n\r\n-\tThe paper works on a relevant and interesting problem.\r\n-\tUsing 2 layers of attention in the proposed way have to my knowledge not been used before for video description. The exact architecture is thus novel (but the work claims much more without sufficient attribution, see blow)\r\n-\tThe experiments are evaluated on two datasets, MSVD and Charades, showing performance on the level of related work for MSVD and improvements for Charades.\r\n\r\nWeaknesses:\r\n===============\r\n\r\n1.\tClaims about the contribution\/novelty of the proposed method are not adequately supported. The paper claims that the proposed architecture outperforms all previously proposed methods and leads to new state-of-the-art results in video description without providing sufficient evidence or comparison to other methods. It would be beneficial for the paper to include a thorough comparison with existing state-of-the-art approaches in video description and discuss the limitations or drawbacks of those methods. This would provide a more comprehensive evaluation of the proposed model's performance and help readers understand the extent of its novelty and contributions. Additionally, the paper lacks clarity in describing the motivation behind the proposed memory-based attention model. While the introduction highlights the limitations of existing models in capturing higher-order interactions in video description, it is not clear how the proposed memory-based attention model addresses these limitations. The paper should provide a more detailed explanation of how the memory components enable the model to consider the entire sequence of video frames and reason about local attention more effectively. This would help readers better understand the theoretical foundation of the proposed method and its advantages over previous approaches. Furthermore, the paper could benefit from a more thorough discussion of the limitations and potential future directions of the proposed model. While the results on the MSVD and Charades datasets are promising, it would be valuable to explore the performance of the proposed model on additional benchmark datasets to demonstrate its generalizability. Overall, the paper presents an interesting approach to video description using memory-augmented attention modelling, but it requires further clarification and evaluation to establish its novelty and effectiveness compared to existing methods.","label":153}
{"id":"490cf57b-ba94-494a-abc7-7c9728d75623","text":"\r\nThis paper addresses video captioning with a TEM-HAM architecture, where a HAM module attends over attended outputs of the TEM module when generating the description. This gives a kind of 2-level attention. The model is evaluated on the Charades and MSVD datasets.\r\n\r\n1. Quality\/Clarify: I found this paper to be poorly written and relatively hard to understand. As far as I can tell the TEM module of Section 3.1 is a straight-forward attention frame encoder of Bahdanau et al. 2015 or Xu et al. 2015. The decoder of Section 3.3 is a standard LSTM with log likelihood..However, the novel contribution of this paper lies in the introduction of the HAM module, which utilizes memory-based attention to improve the video description task. The HAM module is designed to reason about where to attend to in the current time step by utilizing memories of past attention, similar to the central executive system in human cognition. This allows the model to effectively reason about local attention and consider the entire sequence of video frames when generating each word in the description. The authors provide a clear motivation for introducing the HAM module by highlighting the limitations of existing attention models in modeling the complex relationships between different parts of the video and the concepts being depicted.\r\n\r\nTo evaluate the proposed architecture, the authors conducted experiments on the challenging and popular MSVD and Charades datasets. The results demonstrate that the proposed model outperforms all previously proposed methods and achieves state-of-the-art performance in video description. This is a significant contribution as it shows the effectiveness of memory-augmented attention modeling for video understanding tasks.\r\n\r\nHowever, while the results are impressive, it would have been beneficial to include a more detailed analysis and discussion of the specific improvements achieved by the proposed architecture. The paper mentions that the proposed model outperforms previous methods, but it does not provide a thorough comparison of the quantitative results. Additionally, it would have been insightful to provide visualizations or examples of the generated video descriptions to showcase the qualitative improvements achieved by the proposed architecture.\r\n\r\nIn terms of the writing style and clarity, the paper can be improved. The description of the TEM module and decoder module lacks clarity and seems to be assumed knowledge. Providing a more detailed explanation of these components would be helpful for readers who are not familiar with the underlying models. The paper can also benefit from organizing the content in a more structured manner to enhance readability.\r\n\r\nIn conclusion, this paper presents a novel memory-augmented attention model for video description that addresses the limitations of existing attention models. The proposed model effectively reasons about local attention and considers the entire sequence of video frames when generating descriptions. The evaluation results show that the proposed architecture achieves state-of-the-art performance on the MSVD and Charades datasets. However, the paper would benefit from a more detailed analysis of the quantitative results and improved clarity in explaining the underlying models.","label":96}
{"id":"ef2c21a7-8049-4fc8-a61a-7b2f4b8e9ec0","text":"Really interesting paper. I have a few questions below.\r\n\r\nabout TEM:\r\nPage 4 you extracted 1 conv map of size LxD for each N frames, which means you have LxD * N features at the end, although each frame has been processed individually, the features from different frames are not related to each other. Can you explain why you chose this approach instead of considering the relationship between frames? Additionally, it would be helpful to provide some insight into the computational complexity of this operation and how it scales with the number of frames. Overall, I find the proposal of a memory-based attention model for video description intriguing. The incorporation of memories of past attention aligns well with the complexities involved in video understanding tasks. I am particularly interested in the evaluation results on the MSVD and Charades datasets, as these are widely used benchmarks in the field. It would be valuable if the paper could provide more details on the evaluation metrics used and how the proposed architecture compares to the state-of-the-art methods. Looking forward to the authors' response and a more extensive discussion on these aspects.","label":23}
{"id":"22861d92-856e-4682-8659-9b743e16f86d","text":"Dear authors, thank you for submitting the interesting paper. Good to see the incorporation of memory in the context of video description generation..Your proposed memory-based attention model for video description is an interesting approach to address the limitations of existing models. This novel concept of utilizing memories of past attention resembles the central executive system in human cognition, enabling your model to effectively reason about local attention and consider the entire sequence of video frames while generating each word. The evaluation results on the challenging MSVD and Charades datasets are impressive, as your architecture outperforms all previously proposed methods and achieves a new state-of-the-art performance in video description. This demonstrates the effectiveness of your memory-augmented attention modeling approach. However, to further enhance the quality of your paper, I recommend providing more details on the specific memory mechanism employed and how it addresses the higher-order interactions in video description tasks. Additionally, it would be helpful to discuss potential limitations of your model and how these could be addressed in future work. Overall, your paper presents a valuable contribution to the field by introducing a memory-based attention model for videos, and I look forward to seeing further developments in this area.","label":23}
{"id":"8d6c78a6-acc9-40ae-b35f-d9de4fcb5da4","text":"It seems that the cited paper proposes a novel memory-based attention model for video description, utilizing memories of past attention to reason about where to attend in the current time step. This approach allows for effective reasoning about local attention and consideration of the entire video sequence, leading to state-of-the-art results on the MSVD and Charades datasets.","label":6}
{"id":"c8459213-b3e7-45c0-bc37-ee396dabeabd","text":"This paper addresses video captioning with a TEM-HAM architecture, where a HAM module attends over attended outputs of the TEM module when generating the description. This gives a kind of 2-level attention. The model is evaluated on the Charades and MSVD datasets.\r\n\r\n1. Quality\/Clarify: I found this paper to be poorly written and relatively hard to understand. As far as I can tell the TEM module of Section 3.1 is a straight-forward attention frame encoder of Bahdanau et al. 2015 or Xu et al. 2015. The decoder of Section 3.3 is a standard LSTM with log likelihood. The HAM module of Section 3.2 is the novel module but is not adequately explained or motivated in the paper. The authors mention that the HAM module attends over the attended outputs of the TEM module, but the specific mechanism or architecture of this module is not clearly described. It would be helpful if the authors provide more details and examples to illustrate the functionality of the HAM module. Additionally, the paper lacks clarity in explaining how the proposed memory-based attention model effectively reasons about local attention and considers the entire sequence of video frames. Although the paper claims that the proposed architecture outperforms all previously proposed methods and achieves state-of-the-art results, the evaluation section fails to provide a comprehensive comparison with other models or detailed analysis of the experimental results. It would be more convincing if the authors include more experimental results and comparisons with existing approaches to demonstrate the effectiveness of their proposed approach. Overall, the paper addresses an important problem in video description\/captioning by proposing a novel memory-based attention model. However, the clarity of the paper can be improved by providing more detailed explanations of the TEM and HAM modules, as well as better illustrating the reasoning process of the memory-based attention model. Additionally, the evaluation section can be strengthened by including more comprehensive comparisons and detailed analysis of the experimental results.","label":109}
{"id":"38b11a87-7427-4c5d-919d-242c6dc3e719","text":"The authors propose a \"hierarchical\" attention model for video captioning.  They introduce a model composed of three parts: the temporal modeler (TEM) that takes as input the video sequence and outputs a sequential representation of the video to the HAM; the hierarchical attention\/memory mechanism (HAM) implements a soft-attention mechanism over the sequential video representation; and finally a decoder that generates a caption. \r\n\r\nRelated to the second series of questions above, it seems as though the authors have chosen to refer to their use of an LSTM (or equivalent RNN) as the output of the Bahdanau et al (2015) attention mechanism as a hierarchical memory mechanism. I am actually sympathetic to this terminology in the sense that the recent popularity of memory-based models seems to neglect the memory implicit in the LSTM state vector, but that said, this seems to seriously misrepresent the significance fo the contribution of this paper. \r\n\r\nI appreciate the ablation study presented in Table 1. Not enough researchers bother with this kind of.I appreciate the ablation study presented in Table 1. Not enough researchers bother with this kind of analysis, and it adds credibility to the authors' claims. However, I have a few concerns. Firstly, the abstract mentions that the proposed architecture outperforms all previously proposed methods and achieves state-of-the-art results. While this sounds promising, the review lacks specific details about the comparative performance metrics and statistical significance. It would be helpful if the authors could provide more information on this aspect. Additionally, I am curious about the computational complexity of the proposed model. Given the high dimensionality and temporal nature of video data, it would be valuable to include a discussion on the efficiency and scalability of the model. Lastly, the paper could benefit from a clearer explanation of how the hierarchical attention\/memory mechanism is different from existing attention mechanisms used in video captioning. Overall, the paper presents an interesting approach and addresses an important problem in the field, but further clarification and comparative analysis would strengthen the contribution of this work.","label":167}
{"id":"8d2b039c-97c8-486b-a16f-66e909c5aa12","text":"The paper proposes an attention-based approach for video description. The approach uses three LSTMs and two attention mechanisms to sequentially predict words from a sequence of frames.\r\nIn the LSTM-encoder of the frames (TEM), the first attention approach predicts a spatial attention per frame, and computes the weighted average. The second LSTM (HAM) predicts an attention over the hidden states of the encoder LSTM.\r\nThe third LSTM which run temporally in parallel to the second LSTM generates the sentence, one word at a time.\r\n\r\n\r\nStrength:\r\n===============\r\n\r\n-\tThe paper works on a relevant and interesting problem.\r\n-\tUsing 2 layers of attention in the proposed way have to my knowledge not been used before for video description. The exact architecture is thus novel (but the work claims much more without sufficient attribution, see blow)\r\n-\tThe experiments are evaluated on two datasets, MSVD and Charades, showing performance on the level of related work for MSVD and improvements for Charades.\r\n\r\nWeaknesses:\r\n===============\r\n\r\n1.\tClaims about the contribution\/novelty of the model seem not to hold: \r\n1.1.\tOne of the main contributions is the Hierarchical Attention\/Memory (HAM):\r\n1.1.1.\tIt is not clear to me how the presented model (Eq 6-8), are significantly different from the presented model in Xu et al \/ Yao et al. While Xu et al. attends over spatial image locations and Yao et al. attend over frames, this model attends over encoded video representations h_v^i. A slight difference might be that Xu et al. use the same LSTM to generate, while this model uses an additional LSTM for the decoding.\r\n1.1.2.\tThe paper states in section 3.2 \u201cwe propose f_m to memorize the previous attention\u201d, however H_m^{t\u2019-1} only consist of the last hidden state. Furthermore, the model f_m does not have access to the \u201cattention\u201d \\alpha. This was also discussed in comments by others, but remains unclear.\r\n1.1.3.\tIn the discussion of comments the authors claim that \u201cattention not only is a function a current time step but also a function of all previous attentions and network states.\u201d: While it is true that there is a dependency but that is true also for any LSTM, however the model does not have access to the previous network states as H_g^{t\u2019-1} only consist of the last hidden state, as well as H_m^{t\u2019-1} [at least that is what the formulas say and what Figure 1 suggests]. \r\n1.1.4.\tThe authors claim to have multi-layer attention in HAM, however it remains unclear where the multi-layer comes from.\r\n1.2.\tThe paper states that in section 3.1. \u201c[CNN] features tend to discard the low level information useful in modeling the motion in the video (Ballas et al., 2016).\u201d This suggests that the approach which follows attacks this problem. However, it cannot model motion as attention \\rho between frames is not available when predicting the next frame. Also, it is not clear how the model can capture anything \u201clow level\u201d as it operates on rather high level VGG features. This claim needs to be further clarified and supported with evidence. \r\n\r\n2. The paper lacks a clear motivation for the proposed model. While the authors mention the limitations of existing attention models for video description, they do not clearly explain why a memory-based attention model is necessary or how it addresses these limitations. Providing a more in-depth discussion on the motivation behind the proposed model would strengthen the paper.\r\n\r\n3. The evaluation section could be further improved. The paper only presents results on two datasets, MSVD and Charades, without comparing the proposed method to other state-of-the-art methods. Including comparisons with other existing methods would help to better understand the performance of the proposed model and its contributions.\r\n\r\n4. The paper would benefit from more detailed explanations of the model architecture. While the overall structure is described, there are some missing details and ambiguities that make it difficult to fully understand the proposed approach. Providing more clarity and detail in the model description would improve the reproducibility of the work.\r\n\r\n5. The paper would benefit from a more comprehensive related work section. While some related work is briefly mentioned, a more thorough survey of existing approaches and how the proposed model differs from them would help to position the work in the broader research context.\r\n\r\nIn summary, the paper proposes a memory-augmented attention model for video description. While the idea of utilizing memories of past attention is interesting, the paper has some weaknesses that need to be addressed. Clarifying the novelty of the proposed model, providing a clearer motivation, improving the evaluation section, providing more detailed explanations of the model architecture, and enhancing the related work section would strengthen the paper and make it more compelling. Overall, the paper has the potential to contribute to the field of video description, but further revisions are needed to address the mentioned weaknesses.","label":467}
{"id":"8b31c52e-9853-458c-84eb-4caabeec62a7","text":"\r\nThis paper addresses video captioning with a TEM-HAM architecture, where a HAM module attends over attended outputs of the TEM module when generating the description. This gives a kind of 2-level attention. The model is evaluated on the Charades and MSVD datasets.\r\n\r\n1. Quality\/Clarify: I found this paper to be poorly written and relatively hard to understand. As far as I can tell the TEM module of Section 3.1 is a straight-forward attention frame encoder of Bahdanau et al. 2015 or Xu et al. 2015. The decoder of Section 3.3 is a standard LSTM with log likelihood. The HAM module of Section 3.2 is the novel module but is not very well described. It looks to be an attention LSTM where the interactions involved in problems such as video description\/captioning, where the relationship between parts of the video and the concepts being depicted is complex. The authors propose a novel memory-based attention model for video description, which utilizes memories of past attention to reason about where to attend to in the current time step. This approach is inspired by the central executive system proposed in human cognition. By incorporating memory-based attention into their model, the authors aim to improve the ability of the model to reason about local attention and consider the entire sequence of video frames while generating each word.However, I found the description of the HAM module in Section 3.2 to be lacking. It is not clearly explained how the HAM module works and how it differs from a standard attention LSTM. Providing more details and specific examples would greatly enhance the clarity of the paper.Additionally, the paper lacks sufficient comparison with existing methods in the field. While the authors claim that their proposed architecture outperforms all previously proposed methods and leads to new state-of-the-art results in video description, no quantitative analysis or comparison is provided to support this claim. Including a thorough evaluation of the proposed architecture against other state-of-the-art methods on the Charades and MSVD datasets would greatly strengthen the paper.In terms of writing style, the paper needs further editing for clarity and coherence. There are some grammatical errors and awkward phrasing throughout the document. Simplifying the language and organizing the sections more logically would improve overall readability.Overall, the idea of incorporating memory-based attention into video description is promising and has the potential to improve the performance of current models. However, the paper needs to provide more detailed explanations of the proposed architecture and perform a more comprehensive evaluation to validate its superiority over existing methods. With improvements in the clarity of the presentation and additional analysis, this work has the potential to make a significant contribution to the field of video captioning.","label":121}
{"id":"c64dd720-895e-44c7-8d53-12c1489495a8","text":"Really interesting paper. I have a few questions below.\r\n\r\nabout I find the proposed memory-based attention model for video description to be particularly compelling. The use of memories of past attention to inform the current attention mechanism seems analogous to the central executive system in human cognition, which is a promising approach. Additionally, the evaluation on the MSVD and Charades datasets shows that the proposed architecture achieves state-of-the-art results in video description. However, I would like to see more details on the specific memory mechanisms employed and how they contribute to the model's performance. Overall, this paper presents a significant contribution to the field of attention modeling in videos.","label":9}
{"id":"a63fe486-4fcb-4b09-af80-71425aa6201b","text":"Dear authors, thank you for submitting the interesting paper. Good to see the incorporation of memory in the context of video description generation. Although in general I like the paper, I have some feedback and questions. Most of my feedback corresponds to the discussion of the HAM at page 5 on the top and the experimental details.\r\n\r\n1. During training, how many video frames 0..N are fed into the training, how many video frames 0..N are fed into the model at each time step? It would be helpful to know the size and duration of the video clips used during training. Additionally, the authors mention that the proposed architecture outperforms all previously proposed methods, but there is no comparison or discussion of these methods in the paper. It would be beneficial to include a thorough analysis of the related work and provide a comparison to highlight the improvements of the proposed approach. Furthermore, the experimental details section lacks information about the hyperparameters used, such as learning rate, batch size, and optimizer. Including these details would make the experiments more reproducible. Overall, the paper shows promise in incorporating memory for video description, but more clarity and thoroughness in the experimental setup and comparison to existing methods would strengthen the contribution.","label":68}
{"id":"03042996-fc5e-4c81-b9ea-3fa1480831ac","text":"It seems that the cited paper proposes a novel memory-based attention model for video description, utilizing memories of past attention to reason about where to attend to in the current time step. The proposed architecture outperforms all previously proposed methods and achieves state-of-the-art results in video description tasks.","label":6}
{"id":"b0e343db-7340-470e-8f3b-c810175da880","text":"I reviewed the manuscript as of December 6th.\r\n\r\nThe authors perform a systematic investigation of various retraining methods for making a classification network robust to adversarial examples. The authors achieve lower error rates using their RAD and IAEC methods perform better then previously introduced distillation methods for retraining networks to be robust to adversarial examples..The authors provide a comprehensive evaluation of defensive methods for deep neural networks (DNNs) against multiple adversarial evasion models. The paper focuses on comparing different retraining strategies to improve the robustness of DNNs against adversarial examples. The authors introduce the RAD and IAEC methods, which outperform previously introduced distillation methods in terms of lowering error rates and improving network robustness. The experimental results demonstrate that the adversarial retraining framework effectively defends against most adversarial examples without introducing additional vulnerabilities or performance penalties. The authors also analyze the transferability of adversarial examples and highlight that the adversarial retraining framework can defend against black-box attacks without prior knowledge of the adversary models. Overall, the paper is well-written, presents clear research objectives, and provides valuable insights into the defense mechanisms against adversarial attacks on DNNs. However, I would suggest further discussion and analysis of the limitations and potential extensions of the proposed methods in future work.","label":54}
{"id":"876eafcd-567b-42ac-94d0-4b2c09bbd7b1","text":"The paper investigates several retraining approached based upon adversarial data. While the experimental evaluation looks reasonable, the actual contribution of this paper is quite small.. However, it is a significant step towards understanding the effectiveness of defensive strategies against various adversarial evasion models. The comparison of the adversarial retraining framework with state-of-the-art robust deep neural networks offers valuable insights into their robustness and vulnerability in terms of the distortion required to mislead the learner. The experimental results demonstrate that the adversarial retraining framework can effectively defend against most adversarial examples without compromising the original model's performance. Overall, this paper contributes to the ongoing research on the robustness of deep neural networks against adversarial attacks and provides practical implications for developing more secure machine learning models.","label":25}
{"id":"6e5cae8e-d8ee-41b3-b057-bb4e162dc544","text":"We thank all reviewers and readers for the helpful comments! We have updated the paper with the suggested revisions. In our updated paper, we provide a comprehensive evaluation of defensive methods for deep neural networks against multiple adversarial evasion models. We compare the robustness and vulnerability of various strategies, including distillation, autoencoder stacked with classifier (AEC), and our improved version, IAEC. Our experimental results demonstrate that the adversarial retraining framework outperforms these state-of-the-art techniques, defending against most adversarial examples without compromising the model's original performance. We also analyze the transferability of these defenses, showing that our proposed approach can effectively defend against black-box attacks without prior knowledge of the adversary models. We have incorporated these findings and addressed the reviewers' suggestions in the updated version of our paper.","label":17}
{"id":"4a783498-ad60-4b90-8e4a-0c772ca130ed","text":"This paper performs a series of experiments to systematically evaluate the robustness of several defense methods, including RAD, AEC and its improved version etc.. It provides interesting observations. Overall, RAD and distillation have the best performances, but none of the methods can really resist the 'additional' attack from cg or adam. Since it is an experimental paper, my main concern is about its clarity. See the comments below The paper's evaluation of defensive strategies against adversarial evasion models is both thorough and informative. The experiments conducted to assess the robustness of methods like RAD, AEC, and its improved version, provide valuable insights. Overall, the results suggest that RAD and distillation exhibit the best performances, which is a significant finding. However, it is worth noting that none of the methods proved to be completely resistant to additional attacks from cg or adam. These observations highlight the limitations of the current defense mechanisms. While the experimental nature of the paper is commendable, clarity is a concern. Some sections could benefit from further explanation or additional detail. Providing more context on the methodologies utilized would greatly enhance the comprehensibility of the findings. Additionally, demonstrating how these defensive strategies can be implemented in real-world applications would make the paper more practical and actionable. Despite these minor concerns, the paper presents compelling evidence for the effectiveness of the adversarial retraining framework in defending against adversarial examples. The experiments show consistent defense and minimal impact on the original model's performance. Overall, this paper makes a significant contribution to the field of robust deep neural networks and provides a solid foundation for future research in this area.","label":68}
{"id":"d5d04e6a-45b2-409b-8260-6c6b41226e84","text":"The paper compares several defense mechanisms against adversarial attacks: retraining, two kinds of autoencoders and distillation with the conclusion that the retraining methodology proposed by Li et al. works best of those approaches.\r\n\r\nThe paper documents a series of experiments on making models robust against adversarial examples. The methods proposed here are not all too original, RAD was proposed by Li et al, distillation was proposed in 2017 by Papernot et al., and autoencoders have been widely used for defense against adversarial examples. However, the paper makes a valuable contribution by comparing these different defensive methods against various adversarial evasion models. The experimental results demonstrate that the adversarial retraining framework is particularly effective in defending against adversarial examples while avoiding additional vulnerabilities or performance degradation. This finding is significant as it provides insights into the cross-model efficiency and robustness of different defense strategies. Overall, the paper contributes to the field of adversarial machine learning by evaluating and comparing the effectiveness of various defensive methods, ultimately highlighting the superiority of the adversarial retraining framework proposed by Li et al.","label":66}
{"id":"379149cc-f69b-4370-955d-70a7fe8e5f9d","text":"I reviewed the manuscript as of December 6th.\r\n\r\nThe authors perform a systematic investigation of various retraining methods for making a classification model more robust against adversarial attacks. The paper addresses the problem of adversaries injecting small perturbations to input data, which can significantly decrease the performance of deep neural networks. The authors compare different defensive strategies against various adversary models and analyze the cross-model efficiency for these robust learners. They conclude that the adversarial retraining framework, along with other state-of-the-art robust deep neural networks, can defend most of the adversarial examples notably and consistently without deteriorating the performance of the original model. The experimental results presented in the paper validate the effectiveness of the adversarial retraining framework and its ability to defend against adversarial attacks without requiring prior knowledge of the adversary models. However, some additional details and clarifications are needed to better understand the methodology and results. For example, it would be useful to provide a more detailed description of the dataset used for evaluation and the specific metrics used to measure the robustness and vulnerability of the different models. Overall, this paper makes a valuable contribution to the field of robust deep neural networks and provides insights into the effectiveness of different defensive methods against adversarial attacks.","label":21}
{"id":"4c63829c-60b2-4b1c-9ef3-6d74f5740f61","text":"I reviewed the manuscript as of December 6th.\r\n\r\nThe authors perform a systematic investigation of various retraining methods for making a classification network robust to adversarial examples. The authors achieve lower error rates using their RAD and IAEC methods perform better then previously introduced distillation methods for retraining networks to be robust to adversarial examples. This method suggests a promising direction for building a defense for adversarial examples.\r\n\r\nMajor Comments:\r\nI find the authors' empirical comparison and analysis of different defensive strategies against various adversary models is thorough and provides valuable insights into the cross-model efficiency of these robust learners. Additionally, their experimental results clearly demonstrate the effectiveness of the adversarial retraining framework in defending against adversarial examples without compromising the original model's performance or introducing additional vulnerabilities. The authors' approach of comparing the general adversarial retraining framework with state-of-the-art deep neural networks, such as distillation, AEC, and IAEC, adds credibility to their findings. Overall, this paper presents a comprehensive evaluation of defensive methods for DNNs against multiple adversarial evasion models and makes a significant contribution to the field of adversarial machine learning. However, I suggest the authors provide more details on the evaluation metrics used and discuss the potential limitations of their approach. Additionally, it would be beneficial to include a section discussing future research directions in this area. With these additions, the manuscript will be even stronger and more informative.","label":69}
{"id":"c1fefa3c-b248-4ce8-ad8b-ace2173454b2","text":"The paper investigates several retraining approached based upon adversarial data. While the experimental evaluation looks reasonable, the actual contribution of this paper is quite small. The approaches being evaluated, for the most part, are already proposed in the literature, with the one exception being the \"improved autoencoder stacked with classifier\" (IAEC), which is really just a slight modification of an existing approach. The paper lacks novelty and does not provide significant new insights into the field of defending against adversarial attacks. Additionally, the paper fails to provide a thorough analysis of the cross-model efficiency of the defensive strategies. Overall, while the experimental results show some promising defense against adversarial examples, the lack of original contributions and thorough analysis limits the impact of this paper.","label":55}
{"id":"d8968d94-e428-4070-8e32-c016a6318d9d","text":"We thank all reviewers and readers for the helpful comments! We have updated the paper with the following changes:\r\n\r\n1) Provide more details about the experimental settings in section 4.1.\r\n2) Add a section \"Related Work\" (section 2) and move the detailed discussion of related works in Introduction to this section for better organization of the paper.\r\n3) Clarify the definition of 'adversarial examples' in the abstract to provide more context for readers who may not be familiar with the term.\r\n4) Consider providing more explanation and analysis for the experimental results, particularly in terms of the robustness and vulnerability of the different defensive strategies.\r\nOverall, these updates will enhance the clarity and comprehensiveness of the paper. We appreciate the authors' prompt response to the reviewers' comments and believe that these revisions will greatly improve the overall quality of the paper.","label":48}
{"id":"661c1aec-38f7-4e3d-bff7-c420b250c6b7","text":"This paper performs a series of experiments to systematically evaluate the robustness of several defense methods, including RAD, AEC and its improved version etc.. It provides interesting observations. Overall, RAD and distillation have proven to be effective in defending adversarial examples with minimal impact on the original model's performance. The authors compare these defense methods with the state-of-the-art robust deep neural networks and find that the adversarial retraining framework stands out by defending most of the adversarial examples notably and consistently without adding additional vulnerabilities or performance penalties. The experiments conducted provide valuable insights into the cross-model efficiency of these defense strategies. Additionally, the paper highlights the transferability of adversarial examples in attacking black-box models based on finite queries, emphasizing the need for robust defense mechanisms. However, the review lacks a discussion on the limitations of the study and potential future directions. It would be beneficial to explore the scalability and applicability of the adversarial retraining framework to larger and more complex deep neural networks. Furthermore, an analysis of the computational costs associated with different defense methods could provide a more comprehensive understanding of their practical feasibility. Overall, this paper makes a significant contribution in evaluating defensive methods for DNNs against multiple adversarial evasion models and presents compelling evidence supporting the effectiveness of the adversarial retraining framework.","label":33}
{"id":"6cda9d7d-bea5-4efa-957d-f6ebbc79148d","text":"The paper compares several defense mechanisms against adversarial attacks: retraining, two kinds of autoencoders and distillation with the conclusion that the retraining methodology proposed by Li et al. works best of those approaches.\r\n\r\nThe paper documents a series of experiments on making models robust against adversarial examples..Specifically, the authors investigate the transferability of adversarial examples and analyze the cross-model efficiency for various robust learners. They also compare the performance of the adversarial retraining framework with state-of-the-art methods such as distillation, autoencoder stacked with classifier (AEC), and an improved version called IAEC. The experimental results demonstrate that the adversarial retraining framework is highly effective in defending against adversarial examples without introducing additional vulnerabilities or performance penalties. Overall, the paper provides valuable insights into the evaluation and comparison of defensive methods for deep neural networks against multiple adversarial evasion models.","label":46}
{"id":"3e6747c2-815b-4775-bfd3-c9405207382b","text":"I reviewed the manuscript as of December 6th.\r\n\r\nThe authors perform a systematic investigation of various retraining methods for making a classification network robust to adversarial examples. The authors achieve lower error rates using their RAD and IAEC methods perform better then previously introduced distillation methods for retraining networks to be robust to adversarial examples. This method suggests a promising This method suggests a promising direction for defending against adversarial examples in deep neural networks. The authors provide a comprehensive comparison of different defensive strategies and demonstrate the effectiveness of the adversarial retraining framework. The experimental results show that this framework successfully defends against most adversarial examples without introducing additional vulnerabilities or performance penalties. Furthermore, the authors compare their approach with state-of-the-art robust deep neural networks, such as distillation, autoencoder stacked with classifier (AEC), and their improved version, IAEC. The results indicate that the adversarial retraining framework outperforms these methods in terms of robustness and the distortion required to mislead the learner. Overall, the paper presents a well-executed empirical evaluation of defensive methods for DNNs against multiple adversarial evasion models. The findings contribute to the understanding of adversarial attacks and mitigation techniques in deep learning and have practical implications for improving the security and reliability of DNNs.","label":59}
{"id":"17fbbc03-d0d5-470a-9634-d91fd6b24ac8","text":"This paper considers the problem of decoding diverge solutions from neural sequence models. It basically adds an additional term to the log-likelihood of standard neural sequence models, and this additional term will encourage the solutions to be diverse. In addition to solve the inference, this paper uses a modified beam search.\r\n\r\nOn the plus side, there is not much work on producing diverse solutions in RNN\/LSTM models. This paper represents one of the few works on this topic. And this paper is well-written and easy to follow.\r\n\r\nThe novel of this paper is relatively small. There has been a lot of prior work on producing diverse models in the area of probailistic graphical models. Most of them introduce an additional term in the objective function to encourage diversity..However, this paper addresses the problem specifically in the context of neural sequence models, which is a significant contribution. By introducing Diverse Beam Search (DBS), the authors propose a method to decode diverse outputs by optimizing a diversity-augmented objective. This method not only improves diversity but also achieves better top 1 solutions by balancing exploration and exploitation in the search space. The authors further demonstrate the efficacy of their method across various tasks including image captioning, machine translation, conversation, and visual question generation. They compare the performance of DBS with standard beam search and previously proposed techniques for diverse decoding from neural sequence models, consistently showing that their method outperforms the others. This empirical evaluation, along with the quantitative metrics and qualitative human studies, adds credibility to the claims made in the paper. While there have been prior works on diversity in probabilistic graphical models, the adaptation to neural sequence models is non-trivial and valuable. Overall, this paper contributes to the field by providing a solution to the problem of producing diverse solutions in neural sequence models, which has important implications for capturing the inherent ambiguity in complex AI tasks.","label":126}
{"id":"ed0a0ba6-4e60-4985-b039-8be8a228c971","text":"Unfortunately, even after the reviewers adjusted their scores, this paper remains very close to the decision boundary. It presents a a novel approach to decoding diverse solutions from neural sequence models. The paper addresses the limitations of traditional beam search by introducing Diverse Beam Search (DBS), which aims to generate diverse output sequences by optimizing a diversity-augmented objective. The authors demonstrate that DBS not only improves diversity but also achieves better top 1 solutions by carefully balancing exploration and exploitation in the search space. Additionally, they showcase the effectiveness of DBS in various AI tasks, including image captioning, machine translation, conversation, and visual question generation, through both quantitative metrics and qualitative human studies. Overall, this paper presents a promising solution to the problem of capturing ambiguity and enhancing diversity in neural sequence models.","label":20}
{"id":"eb476b63-c008-4ec1-9d15-2c5b169554f3","text":".The paper 'Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models' introduces an innovative approach to address the limitation of beam search (BS) in producing sequences that lack diversity. The authors propose Diverse Beam Search (DBS), a method that aims to decode a list of diverse output sequences by optimizing a diversity-augmented objective. The evaluation of DBS shows promising results in terms of improved diversity and better top 1 solutions. It is noteworthy that these improvements are achieved without significant computational or memory overhead. The paper demonstrates the applicability of DBS in various tasks including image captioning, machine translation, conversation, and visual question generation. Both quantitative metrics and qualitative human studies consistently demonstrate the superiority of DBS over BS and other diverse decoding techniques. Overall, this paper presents a valuable contribution to the field of neural sequence models by enhancing the quality and diversity of decoded output sequences through DBS.","label":0}
{"id":"9818ab72-5f7d-42b2-93ca-ed044e4c707e","text":"\r\nThe paper addresses an important problem - namely on how to improve diversity in responses. It is applaudable that the authors show results on several tasks showing the applicability across different problems. \r\n\r\nIn my view there are two weaknesses at this point\r\n\r\n1) the improvements (for essentially all tasks) the improvements (for essentially all tasks) are not substantial enough to justify the introduction of a new decoding algorithm. While the authors claim that DBS finds better top 1 solutions, the concrete evidence and comparison to existing techniques are lacking. Additionally, it would be helpful to see a more detailed analysis of the computational and memory overhead of DBS compared to beam search. Overall, the paper has potential but needs further experimental validation and analysis.","label":48}
{"id":"7298f5b9-cabc-4973-9af0-bb657affcde0","text":"\r\n\r\n[ Summary ]\r\n\r\nThis paper presents a new modified beam search algorithm that promotes diverse beam candidates. It is a well known problem \u2014with both RNNs and also non-neural language models\u2014 that beam search tends to generate beam candidates that are very similar with each other, which can cause two separate but related problems: (1) search error: beam search may not be able to discover a globally optimal solution as they can easily fall out of the beam early on, (2) simple, common, non-diverse output: the resulting output text tends to be generic and common.\r\n\r\nThis paper aims to address the second problem (2) by modifying the search objective function itself so that there is a distinct term that scores diversity among the beam candidates. In other words, the goal of the presented algorithm is not to reduce the search error of the original objective function. In contrast, stack decoding and future cost estimation, common practices in this field, tend to focus on reducing search error. Instead, the proposed Diverse Beam Search algorithm introduces a diversity-augmented objective function that aims to generate a diverse set of output sequences. The paper provides a detailed explanation of how the diversity term is incorporated into the objective function and how it encourages the exploration of different search paths. Experimental results on various tasks such as image captioning, machine translation, conversation, and visual question generation demonstrate the effectiveness of the proposed method. The authors show that Diverse Beam Search not only improves the diversity of generated outputs but also outperforms the traditional Beam Search algorithm and other diverse decoding techniques. The results are evaluated using both standard quantitative metrics and qualitative human studies, which further validate the superiority of the proposed approach. Overall, this paper addresses an important limitation of Beam Search and presents a novel and effective solution to the problem of generating diverse solutions from neural sequence models.","label":155}
{"id":"12f4d02a-a09a-4ff8-8e4d-ea7e699fe0b2","text":"This paper considers the problem of decoding diverge solutions from neural sequence models. It basically adds an additional term to the log-likelihood of standard neural sequence models, and this additional term will encourage the solutions to be diverse. In addition to solve the inference, this paper uses a modified beam search.\r\n\r\nOn the plus side, there is not much work on producing diverse solutions in RNN\/LSTM models. This paper represents one of the few works on this topic. And this paper is well-written and easy to follow.\r\n\r\nThe novel of this paper is relatively small. There has been a lot of prior work on producing diverse models in the area of probailistic graphical models. Most of them introduce an additional term in the objective function to encourage diversity..However, the proposed Diverse Beam Search (DBS) in this paper applies a similar concept to neural sequence models, which is a novel contribution within this specific domain. While it is true that there have been prior works on producing diverse models in the area of probabilistic graphical models, the application of these techniques to neural sequence models is nontrivial and requires careful consideration. This paper presents a comprehensive analysis of the challenges and proposes an effective solution in the form of DBS. The authors demonstrate the superiority of DBS over traditional beam search by achieving not only improved diversity in the output sequences but also better top 1 solutions. This is a significant finding as it shows that DBS strikes a balance between exploration and exploitation of the search space. Furthermore, the experimental evaluation conducted by the authors on various tasks such as image captioning, machine translation, conversation, and visual question generation highlights the broad applicability of DBS. The results obtained using both quantitative metrics and qualitative human studies consistently show that DBS outperforms beam search and previously proposed techniques for diverse decoding in neural sequence models. Overall, this paper makes a valuable contribution to the field by introducing and exploring the concept of DBS, which has the potential to enhance the performance of neural sequence models in various domains.","label":126}
{"id":"0e8e7be5-46b0-4445-bdd7-e5cac2202b21","text":"This paper considers the problem of decoding diverge solutions from neural sequence models. It basically adds an additional term to the log-likelihood of standard neural sequence models, and this additional term will encourage the solutions to be diverse. In addition to solve the inference, this paper uses a modified beam search.\r\n\r\nOn the plus side, there is not much work on producing diverse solutions in RNN\/LSTM models. This paper represents one of the few works on this topic. And this paper is well-written and easy to follow.\r\n\r\nThe novel of this paper is relatively small. There has been a lot of prior work on producing diverse models in the area of probailistic graphical models. Most of them introduce an additional term in the objective function to encourage diversity. From that perspective, the solution proposed in this paper is not that different from previous work. Of course, one can argue that most previous work focues on probabilistic graphical models, while this paper focuses on neural sequence models, which is a different domain. However, it would have been beneficial for the authors to provide a more comprehensive discussion and comparison with existing methods in the area of probabilistic graphical models. Additionally, the paper claims that the proposed Diverse Beam Search (DBS) not only improves diversity but also finds better top 1 solutions by controlling the exploration and exploitation of the search space. While this claim is supported by experimental results on image captioning, machine translation, conversation, and visual question generation tasks, more analysis and insights into the underlying reasons for this improvement would have strengthened the paper. Furthermore, it would have been interesting to see how DBS compares to other recently proposed techniques for diverse decoding from neural sequence models. Overall, Diverse Beam Search is a valuable contribution to the field, addressing the inherent ambiguity in complex AI tasks and improving the diversity of output sequences from neural sequence models. However, further exploration and comparison with existing methods are needed to fully understand its strengths and limitations.","label":157}
{"id":"93e52e7d-4653-4c77-afd8-1864d6a59759","text":"Unfortunately, even after the reviewers adjusted their scores, this paper remains very close to the decision boundary..It would have been helpful if the reviewers provided more specific feedback on why the paper remains close to the decision boundary. It is important to understand the aspects of the paper that are lacking or could be improved upon to make a more informed decision. Additionally, it would be valuable to have a clear comparison with other similar approaches in terms of computational efficiency and memory usage. Overall, more details and clarity would greatly enhance the review process.","label":17}
{"id":"65e5b5bf-a750-4bff-b880-1ef0bd66b4e4","text":"Following suggestions from Reviewer 2, we have revised our submission to include comparison  comparison of Diverse Beam Search (DBS) with other techniques for diverse decoding from neural sequence models. Our results show that DBS consistently outperforms Beam Search (BS) and previously proposed methods in terms of diversity, finding better top 1 solutions, and also achieving these improvements with minimal computational or memory overhead compared to BS. Moreover, we have demonstrated the broad applicability of DBS by presenting results on image captioning, machine translation, conversation, and visual question generation using both standard quantitative metrics and qualitative human studies. These experiments further support the effectiveness and superiority of DBS over existing techniques for decoding diverse outputs from neural sequence models.","label":13}
{"id":"9225c8db-5859-418e-bedc-d2c009a515e9","text":"\r\nThe paper addresses an important problem - namely on how to improve diversity in responses..The authors propose a solution called Diverse Beam Search (DBS) as an alternative to traditional beam search (BS) in neural sequence models. DBS optimizes a diversity-augmented objective to decode a list of diverse outputs, which not only improves diversity but also identifies better top 1 solutions by effectively exploring and exploiting the search space. Additionally, the authors show that DBS achieves these gains with minimal computational or memory overhead compared to BS. They present results on various tasks such as image captioning, machine translation, conversation, and visual question generation, demonstrating the wide applicability and consistent improvement of DBS over existing techniques for diverse decoding from neural sequence models.","label":15}
{"id":"a4d951fb-e51e-47ae-9956-e7711c1ce607","text":"\r\n\r\n[ Summary ]\r\n\r\nThis paper presents a new modified beam search algorithm that promotes diverse beam candidates. It is a well known problem \u2014with both RNNs and also non-neural language models\u2014 that beam search tends to generate beam candidates that are very similar with each other, which can cause two separate but related problems: (1) search error: beam search may not be able to discover a globally optimal solution as they can easily fall out of the beam early on, (2) simple, common, non-diverse output: the resulting output text tends to be generic and common.\r\n\r\nThis paper aims to address the second problem (2) by proposing a novel approach called Diverse Beam Search (DBS). The authors argue that by optimizing a diversity-augmented objective, DBS is able to decode a list of diverse outputs while still maintaining high-quality solutions. The method achieves this by exploring the search space in a more balanced manner, allowing for both exploration and exploitation. The authors demonstrate the effectiveness of DBS on various tasks including image captioning, machine translation, conversation, and visual question generation. They compare DBS with traditional beam search and show that DBS consistently outperforms it in terms of diversity and quality of solutions. The results are evaluated using both standard quantitative metrics and qualitative human studies. Additionally, the authors highlight that DBS achieves these improvements with minimal computational and memory overhead compared to beam search. This makes DBS a practical and efficient solution for decoding diverse outputs from neural sequence models. Overall, this paper provides a valuable contribution to the field of natural language processing by addressing the lack of diversity in beam search and presenting an effective alternative. The method has been thoroughly evaluated and the results demonstrate its superiority over existing techniques. The paper could benefit from further discussions on potential limitations and future directions for research.","label":102}
{"id":"b59645c7-4d83-4aac-b303-d156146dd4db","text":"This paper considers the problem of decoding diverge solutions from neural sequence models. It basically adds an additional term to the log-likelihood of standard neural sequence models, and this additional term will encourage the solutions to be diverse. In addition to solve the inference, this paper uses a modified beam search.\r\n\r\nOn the plus side, there is not much work on producing diverse solutions in RNN\/LSTM models. This paper represents one of the few works on this topic. And this paper is well-written and easy to follow.\r\n\r\nThe novel of this paper is relatively small. There has been a lot of prior work on producing diverse models in the area of probailistic graphical models. Most of them introduce an additional term in the objective function to encourage diversity. From that perspective, the solution proposed in this paper is not that different from previous work..However, this paper makes an important contribution by addressing the problem of decoding diverse solutions specifically in the context of neural sequence models. The authors recognize that while there may be existing techniques for producing diversity in probailistic graphical models, applying these methods directly to neural sequence models may not be straightforward due to the unique characteristics of these models. By proposing Diverse Beam Search (DBS), the authors adapt the beam search algorithm to optimize a diversity-augmented objective. This approach allows for the exploration of a wider range of candidate solutions and significantly improves the diversity of the generated sequences. Moreover, the authors show that DBS not only enhances diversity but also improves the quality of the top 1 solutions. The experimental results on image captioning, machine translation, conversation, and visual question generation tasks demonstrate the effectiveness and broad applicability of the proposed method. Overall, while the idea of incorporating diversity into sequence models may not be entirely new, the specific formulation and modification of beam search for achieving diversity in the context of neural sequence models makes this paper a valuable contribution to the field.","label":142}
{"id":"c0f95e15-7f40-4ee8-9afc-ac73bcf44b92","text":"The proposed method is simple and elegant; it builds upon the huge success of gradient based optimization for deep non-linear function approximators and combines it with established (linear) many-view CCA methods. A major contribution of this paper is the derivation of the gradients with respect to the non-linear encoding networks which project the different views into a common space. The derivation seems correct. In general this approach seems very interesting and I could imagine that it might be applicable to many other similarly structured problems.\r\nThe paper is well-written and the authors effectively communicate the motivation, methodology, and results of their work. The introduction provides a clear overview of the problem and the limitations of existing approaches. The authors explain how DGCCA fills a gap in the literature by combining the power of non-linear representation learning with the ability to incorporate information from multiple independent sources. The derivation of the gradients with respect to the non-linear encoding networks is well-explained and adds to the overall clarity of the paper. The experimental section is also well-executed, with the authors applying DGCCA to two distinct datasets and evaluating its performance on three different downstream tasks. The results clearly demonstrate the superiority of DGCCA over existing methods for phonetic transcription and hashtag recommendation. It is worth noting that DGCCA performs no worse than standard linear many-view techniques, thus highlighting its versatility and effectiveness. Overall, this paper makes a significant contribution to the field of multi-view representation learning and deep non-linear transformation. The combination of deep learning with CCA is a powerful approach that has the potential to be applied to a wide range of problems. I would recommend this paper for publication in its current form.","label":87}
{"id":"181c7e56-f9ea-49e6-969e-e2468953b713","text":"This is largely a clearly written paper that proposes a nonlinear generalization of a generalized CCA approach for multi-view learning. In terms In terms of novelty, the paper presents a significant advance in the field as it combines the flexibility of deep representation learning with the statistical power of incorporating information from multiple independent sources. The authors provide a clear and concise explanation of the DGCCA formulation and an efficient stochastic optimization algorithm for solving it. The empirical evaluation on two distinct datasets for three different tasks demonstrates the superiority of DGCCA over existing methods in phonetic transcription and hashtag recommendation. Moreover, the results show that DGCCA performs at least as well as standard linear many-view techniques. Overall, this paper makes a valuable and rigorous contribution to the field of multi-view representation learning.","label":22}
{"id":"60bf867b-0e22-466a-a6f0-1605ce5df605","text":"Thank you for your helpful comments.  We just uploaded a revised draft, incorporating the reviewers' suggestions, and hopefully addressing many of their concerns.  Below are the things to note:\r\n\r\n- The linear GCCA solution for G and U is included in Appendix A, along with a full gradient derivation: \"... the rows of G are the top r (orthonormal) eigenvectors of M, and $U_j = C_{jj}^{\u22121} Y_j G^T$\" (reviewer 2)\r\n\r\n- In the last paragraph of the Optimization subsection (page 4), we include big-Oh notation for the gradient update time complexity.  We leverage the GCCA solution presented in [R1] to scale DGCCA to large datasets. (reviewer 2)\r\n\r\n- We qualify the pronouncement of being \"the first nonlinear multiview learning technique\" with the adjective \"CCA-style\".  Although our work focuses on extending CCA-based multiview methods, We qualify the pronouncement of being \"the first nonlinear multiview learning technique\" with the adjective \"CCA-style\". Although our work focuses on extending CCA-based multiview methods, we acknowledge that there have been other nonlinear multiview techniques developed in the past. However, our contribution lies in combining the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources or views. This unique combination sets DGCCA apart from existing methods and makes it a valuable tool for various applications.\r\n\r\nWe would also like to address the concerns raised by reviewer 1 regarding the computational complexity of DGCCA. While it is true that optimizing deep neural networks can be computationally expensive, we have implemented an efficient stochastic optimization algorithm for solving the DGCCA formulation. The algorithm leverages the GCCA solution presented in [R1] to scale DGCCA to large datasets, thus mitigating the computational challenges. Our experiments on two distinct datasets for three downstream tasks demonstrate the feasibility and effectiveness of DGCCA in real-world scenarios.\r\n\r\nFurthermore, we appreciate the feedback from reviewer 2 regarding the inclusion of the linear GCCA solution for G and U in Appendix A. We have now included the derivation and formulation in the appendix to provide a comprehensive understanding of the methodology.\r\n\r\nIn summary, we believe that the revised draft addresses the concerns raised by the reviewers and provides a more complete and accurate account of our research. We are confident in the novelty, effectiveness, and relevance of our proposed method, and we look forward to further discussions and insights from the reviewers and the scientific community.\r\n","label":134}
{"id":"7473a95a-ba0b-4715-adf2-d8606a506fb6","text":"The authors propose a method that extends the non-linear two-view representation learning methods, and the linear multiview techniques, and combines information from multiple sources into a new non-linear representation learning techniques. \r\n\r\nIn general, the method is well described and seems to lead to benefits in different experiments of phonetic transcription of hashtag recommendation. Even if the method is mostly a extension of classical tools (the scheme learns a (deep) network for each view essentially), the combination of the different sources of information seems to be effective for the studied datasets. \r\n\r\nIt would It would be helpful if the authors provided more details on the architecture and hyperparameters of the deep networks used in the DGCCA formulation. Additionally, it would be beneficial to include more information on how the stochastic optimization algorithm was implemented and its convergence properties. While the paper claims that DGCCA outperforms existing methods in phonetic transcription and hashtag recommendation tasks, it would be useful to have a more thorough comparison with these methods in terms of computational efficiency and interpretability of the learned representations. Furthermore, it would be interesting to explore the performance of DGCCA on larger and more diverse datasets to assess its scalability and generalizability. Overall, the paper presents a promising method for nonlinear multiview representation learning, but further experimental validation and comparison with existing techniques would strengthen the conclusions and insights provided.","label":92}
{"id":"18b0682f-35c4-4e17-8c69-b20ba8df612e","text":"This paper proposes a deep extension of generalized CCA. The main contribution of the paper is deriving the gradient update for the GCCA objective.\r\n\r\nI disagree with the claim that \u201cthis is the first Multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. The authors present the DGCCA formulation, along with an efficient stochastic optimization algorithm for solving it. They also demonstrate the effectiveness of DGCCA on two distinct datasets for three downstream tasks: phonetic transcription from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. The results show that DGCCA representations outperform existing methods in phonetic transcription and hashtag recommendation, while performing similarly to standard linear many-view techniques in general. However, I disagree with the claim made in the abstract that DGCCA is the first multiview representation learning technique to combine the flexibility of nonlinear representation learning with the incorporation of information from multiple views. Overall, this paper presents a novel deep extension of generalized CCA and provides valuable insights into the capabilities and limitations of DGCCA.","label":41}
{"id":"6f94a5f3-3aa2-4012-b1fa-00809414d9ee","text":"The proposed method is simple and elegant; it builds upon the huge success of gradient based optimization for deep non-linear function approximators and combines it with established (linear) many-view CCA methods. A major contribution of this paper is the derivation of the gradients with respect to the non-linear encoding . This allows for efficient learning of the deep generalized canonical correlation analysis (DGCCA) representations. By incorporating information from many independent sources or views, DGCCA provides a powerful and flexible approach for multiview representation learning. The authors demonstrate the effectiveness of DGCCA on two distinct datasets for three downstream tasks: phonetic transcription from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. The experimental results show that DGCCA representations outperform existing methods for phonetic transcription and hashtag recommendation, while also performing on par with standard linear many-view techniques. This highlights the potential of DGCCA in various domains where multiple views of data are available. Overall, this paper presents a novel and efficient approach for nonlinear multiview representation learning by combining deep neural networks with generalized canonical correlation analysis. The method is well-motivated, and the experimental results provide strong evidence of its effectiveness. The paper is well-written, with clear explanations of the method and thorough evaluations. It makes a significant contribution to the field and is suitable for publication in a peer-reviewed journal or conference proceedings.","label":49}
{"id":"200e0f06-900d-40b9-9226-4a8ed9dd5d55","text":"The proposed method is simple and elegant; it builds upon the huge success of gradient based optimization for deep non-linear function approximators and combines it with established (linear) many-view CCA methods. A major contribution of this paper is the derivation of the gradients with respect to the non-linear encoding networks which project the different views into a common space. The derivation seems correct. In general this approach seems very interesting and I could imagine that it might be applicable to many other similarly structured problems.\r\nThe paper is well written; but it could be enhanced with an explicit description of the complete algorithm which also highlights how the joint embeddings G and U are updated. \r\n \r\nI don\u2019t have prior experience with CCA-style many-view techniques and I don\u2019t have prior experience with CCA-style many-view techniques and therefore cannot fully assess the novelty of DGCCA in comparison to other existing methods. However, I believe the experimental results presented in the paper are impressive and demonstrate the effectiveness of the DGCCA approach. The authors compare DGCCA with existing methods on two different datasets and show that it outperforms the alternatives in phonetic transcription and hashtag recommendation tasks. The fact that DGCCA performs no worse than standard linear many-view techniques suggests that it can provide valuable insights even in cases where linearity assumptions hold. Overall, I find the paper to be well-written and the proposed method to be interesting and promising. I would recommend the authors to provide some insights into the computational complexity of DGCCA and discuss any possible limitations or assumptions made by the method. Moreover, it would be helpful if the authors discuss potential applications of DGCCA beyond the presented tasks. Overall, I think this paper makes a significant contribution to the field of multi-view representation learning and provides a solid foundation for future research.","label":125}
{"id":"b0f4b7bb-0849-4d68-8020-9f2739b6022e","text":"This is largely a clearly written paper that proposes a nonlinear generalization of a generalized CCA approach for multi-view learning. In terms of technical novelty, the generalization follows rather straightforwardly. Reviewers have expressed the need to clarify relationship and provide comparisons between DGCCA and other existing methods for multi-view learning. The paper should further elaborate on why DGCCA outperforms existing methods in phonetic transcription and hashtag recommendation tasks. Additionally, it would be beneficial to provide more insights on the efficiency of the stochastic optimization algorithm proposed for solving the DGCCA formulation. Overall, further clarification and comparisons would enhance the technical novelty and contribution of the paper.","label":41}
{"id":"ed9d5aef-06ab-4912-aae7-b7c47e7a5e10","text":"Thank you for your helpful comments.  We just uploaded a revised draft, incorporating the reviewers' suggestions, and hopefully addressing many of their concerns.  Below are the things to note:\r\n\r\n- The linear GCCA solution for G and U is included in Appendix A, along with a full gradient derivation: \"... the rows of G are the top r (orthonormal) eigenvectors of M, and $U_j = C_{jj}^{\u22121} Y_j G^T$\" (reviewer 2)\r\n\r\n- In the last paragraph of the Optimization subsection (page 4), we include big-Oh notation for the gradient update time complexity.  We leverage the GCCA solution presented in [R1] to scale DGCCA to large datasets. (reviewer 2)\r\n\r\n- We qualify the pronouncement of being \"the first nonlinear multiview learning technique\" with.We qualify the pronouncement of being \"the first nonlinear multiview learning technique\" with the acknowledgement that there exist methods for nonlinear two-view representation learning (Deep CCA). However, DGCCA is the first CCA-style multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. We apologize for any confusion caused by our initial statement and have made the necessary clarification in the revised draft. Thank you for bringing this to our attention and for your valuable feedback.","label":121}
{"id":"dfbe6ad6-ec93-424d-879c-f25d04ee8486","text":"The authors propose a method that extends the non-linear two-view representation learning methods, and the linear multiview techniques, and combines information from multiple sources into a new non-linear representation learning techniques. \r\n\r\nIn general, the method is well described and seems to lead to benefits in different experiments of phonetic transcription of hashtag recommendation..However, there are a few areas that can be improved upon. Firstly, while the method is well described, there could be more clarity in explaining the specific architecture and implementation details of the Deep Generalized Canonical Correlation Analysis (DGCCA) model. Further elaboration on the motivation behind choosing the particular optimization algorithm used for solving the DGCCA formulation would also be helpful. Additionally, it would be valuable to see a more comprehensive comparison with existing methods for phonetic transcription and hashtag recommendation. The paper mentions that DGCCA outperforms existing methods in these tasks, but the extent of the improvement and the statistical significance of the results are not provided. Furthermore, it would be interesting to understand the interpretability of the DGCCA representations and how they capture the underlying relationships between the views of the data. Overall, this paper presents a promising approach for non-linear multiview representation learning with the inclusion of information from multiple sources. With some further improvements and additional experimental evaluation, DGCCA has the potential to become a valuable tool in various domains that involve analyzing data from diverse perspectives.","label":53}
{"id":"c0a298f9-a781-4f47-a7e3-acf42e282fdb","text":".The paper presents an interesting method called Deep Generalized Canonical Correlation Analysis (DGCCA) for learning nonlinear transformations of multiple views of data. The authors highlight that DGCCA is the first CCA-style multiview representation learning technique that combines the flexibility of deep nonlinear representation learning with the statistical power of incorporating information from many independent sources. They provide a clear formulation of DGCCA and propose an efficient stochastic optimization algorithm for solving it. The authors conduct experiments on two distinct datasets and evaluate DGCCA on three downstream tasks, namely phonetic transcription from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. The results demonstrate that DGCCA outperforms existing methods in terms of phonetic transcription and hashtag recommendation, while performing no worse than standard linear many-view techniques in general. The findings highlight the effectiveness of DGCCA in maximizing the informativeness of nonlinear transformations of multiple views. Overall, the paper offers valuable insights into the benefits of incorporating deep nonlinear representation learning in multiview representation learning, and it presents a promising technique for various applications.","label":0}
{"id":"2708a13c-af5f-49f1-8e3a-b9006c243717","text":"The proposed method is simple and elegant; it builds upon the huge success of gradient based optimization for deep non-linear function approximators and combines it with established (linear) many-view CCA methods. A major contribution of this paper is the derivation of the gradients with respect to the non-linear encoding networks which project the different views into a common space. The derivation seems correct..\r\n\r\nFurthermore, the authors provide an efficient stochastic optimization algorithm for solving the DGCCA formulation. This is crucial for practical applications where datasets could be large and computationally expensive to handle. By leveraging the power of deep learning techniques and incorporating information from multiple independent sources, DGCCA demonstrates its superiority in various downstream tasks.\r\n\r\nThe authors validate the performance of DGCCA by conducting experiments on two distinct datasets. The first dataset involves phonetic transcription from acoustic and articulatory measurements, while the second dataset focuses on recommending hashtags and friends for Twitter users. In both cases, DGCCA outperforms existing methods for phonetic transcription and hashtag recommendation. It is worth noting that DGCCA also performs no worse than standard linear many-view techniques, which shows that it can handle linear scenarios effectively as well.\r\n\r\nOverall, the paper presents a significant contribution in deep multiview representation learning. The methodology is sound and the experiments are representative. There is certainly a clear need in the field for such a technique that combines non-linear representation learning with the statistical power of incorporating information from multiple independent sources. The paper is well-written and provides sufficient details for reproducibility. I would recommend this paper for publication after minor revisions and clarification on a few technical points.","label":63}
{"id":"eea0601e-4aab-4bfb-a89c-424ded51e6ce","text":"In this paper, the authors explicitly design geometrical structure into a CNN by combining it with a Scattering network. This aids stability and limited-data performance. The paper is well written, the contribution of combining Scattering and CNNs is novel and the results seem promising. I feel that such work was a missing piece in the Scattering literature to make it useful for practical applications.\r\n\r\nI wish the authors would have investigated the effect of the stable bottom layers with respect to adversarial examples. This can be done in a relatively straightforward way with software like cleverhans [1] or deep fool [2]..Overall, I find the idea of combining scattering and CNNs to be a valuable contribution to the field. The authors demonstrate the benefits of incorporating stable bottom layers in their hybrid network, which provides increased performance on small datasets and guarantees stability to geometric transformations and noise. The paper is well-written and the results are promising, showing that deep hybrid scattering networks perform as well as deep networks learned from scratch on standard benchmarks. The authors also analyze the third layer of their architecture and show that the learned operators explicitly build invariances to geometrical variabilities such as local rotation and translation. It would have been interesting if the authors had investigated the effect of the stable bottom layers on adversarial examples, as this could provide further insights into the robustness of their approach. Overall, I believe this paper makes a valuable contribution to the field and I look forward to seeing further research and applications of hybrid scattering networks.","label":100}
{"id":"352b9311-907f-4fd6-a030-fdd15de098ad","text":"The program committee appreciates the authors' response to concerns raised in the reviews.  The paper presents an interesting approach of combining scattering and convolutional neural networks to achieve state-of-the-art results on standard benchmarks. The proposed hybrid architecture shows promising results in terms of generalization on small datasets and providing invariances to geometric variabilities. The release of the ScatWave software for fast computations is also commendable. However, it would be beneficial if the authors could provide more detailed experimental results and analysis to support their claims. Overall, this paper makes a valuable contribution and we recommend its acceptance.","label":13}
{"id":"fd51092f-ac47-4938-aae6-eaacead86bbc","text":"Dear reviewers,\r\n\r\nHere are two elements that I have added thanks to your constructive and helpful suggestions:\r\n\r\n- I have added a note in the Appendix B that quantifies precisely the additive perturbations of a Deep network. It proves the unstability of the hybrid deepnetwork is always smaller than the unstability of the cascaded deepnetwork and discusses the equality case. Besides, no straightforward softwares were available in Lua. Since I have also made several revisions to clarify and improve the overall flow of the paper. In response to reviewer comments, I have added a more detailed explanation of the scattering transform and its advantages over traditional deep networks. Additionally, I have included an analysis of the third layer of our hybrid architecture, demonstrating how it explicitly builds invariances to geometrical variabilities. These findings provide further evidence of the effectiveness and robustness of our approach. Furthermore, I have released a new software, ScatWave, which integrates the scattering network in Torch and utilizes GPUs for faster computations. This software is freely available for the research community to use and evaluate. Finally, to validate our model, we have conducted experiments on popular benchmark datasets including CIFAR10, CIFAR100, and STL10, and have achieved nearly state-of-the-art results. Overall, I believe that the results of our study support the effectiveness and viability of hybrid scattering networks in achieving strong performance on small datasets while maintaining stability to geometric transformations and noise. Thank you for your valuable feedback, and I look forward to addressing any further questions or concerns you may have.","label":67}
{"id":"da41f1b1-4cea-4694-9265-54268ae9d722","text":"In this paper, the authors explicitly design geometrical structure into a CNN by combining it with a Scattering network. This aids stability and limited-data performance. The paper is well written, the contribution of combining Scattering and CNNs is novel and the results seem promising. I feel that such work was a missing piece in the Scattering literature to make it useful for practical applications.\r\n\r\nI wish the authors would have investigated the effect of the stable bottom layers with respect to adversarial examples. This can be done in a relatively straightforward way with software like cleverhans [1] or deep fool [2]. It would be very interesting if the first layer's stability in the hybrid architectures increases robustness significantly, as this would provide further evidence that combining Scattering and CNNs can lead to improved performance in challenging scenarios. Additionally, it would have been helpful if the authors had compared their proposed hybrid network with other state-of-the-art approaches, such as fully supervised deep networks or transfer learning methods. This would give a clearer understanding of the relative strengths and weaknesses of their approach. However, overall, the paper presents a valuable contribution to the field and the authors should be commended for their innovative work. With some additional experiments and comparisons, this research has the potential to greatly impact the practical applications of Scattering networks. I look forward to seeing future work expanding on these findings and exploring the potential of hybrid architectures in other domains as well.","label":120}
{"id":"6d394f3f-8ce9-4c6b-ad3b-6a157db57afa","text":"The paper investigates a hybrid network consisting of a scattering network followed by a convolutional network. By using scattering layers, the number of parameters is reduced, and the first layers are guaranteed to be stable to deformations. Experiments show that the hybrid network achieves reasonable performance, and outperforms the network-in-network architecture in the small-data regime.\r\n\r\nI have often heard researchers ask why it is necessary to re-learn low level features of convolutional networks every time they are trained. In theory, using fixed features could save parameters and training time. As far as I am aware, this paper is the first to investigate this question. In my view, the results show that using scattering features in the bottom layers does not work as well as learned CNN features. This is not completely obvious a priori, and so the results are interesting, but I disagree with the framing that the hybrid network is superior in terms of generalization.\r\n\r\nFor the low-data regime, the hybrid network sometimes gives better accuracy but overall, the results are not consistently better than those achieved by deep networks trained from scratch. Additionally, while the paper claims that the hybrid network guarantees stability to geometric transformations and noise in the first layers, it does not provide sufficient evidence or analysis to support this claim. The experiments only focused on the accuracy and generalization performance of the hybrid network without thoroughly examining the stability properties. I believe that further investigation and analysis are needed to validate the stability guarantees and better understand the limitations and advantages of the hybrid network approach. Nonetheless, the release of the ScatWave software is a valuable contribution to the field as it provides a fast computation framework for scattering networks. With further improvements and refinements, the hybrid network approach explored in this paper has the potential to offer significant benefits, but more thorough evaluation and analysis are required to fully assess its effectiveness and compare it to other state-of-the-art deep learning architectures.","label":165}
{"id":"4412f867-5455-4aaa-b44e-88962e7f1ef4","text":"Thanks a lot for your detailed response and clarifications.\r\n\r\nThe paper proposes to use a scattering transform as the lower layers of a deep network. This fixed representation enjoys good geometric properties (local invariance to deformations) and can be thought as a The paper proposes to use a scattering transform as the lower layers of a deep network. This fixed representation enjoys good geometric properties (local invariance to deformations) and can be thought as a generic and efficient initialization of the first layers of a deep network. The remaining layers are learned in a supervised manner, resulting in a hybrid scattering network that achieves nearly state-of-the-art results on standard benchmarks. The authors demonstrate that deep hybrid scattering networks generalize better on small datasets compared to supervised deep networks. This generalization could be attributed to the learned operators explicitly building invariances to geometrical variabilities, such as local rotation and translation, as analyzed in the third layer of the architecture. The paper also highlights that it is possible to replace the scattering transform with a standard deep network, but at the cost of having to learn more parameters and potentially introducing instabilities. Additionally, the authors release a new software, ScatWave, which utilizes GPUs for fast computations of a scattering network integrated in Torch. The proposed model is evaluated on the CIFAR10, CIFAR100, and STL10 datasets. Overall, the paper presents an interesting approach that combines the strengths of scattering and convnet networks, showcasing their benefits in terms of computation time savings, stability to geometric transformations and noise, and improved generalization on small datasets.","label":41}
{"id":"ea1c1ae1-3ecd-4e47-8ff8-9d33cac56252","text":"In this paper, the authors explicitly design geometrical structure into a CNN by combining it with a Scattering network. This aids stability and limited-data performance. The paper is well written, the contribution of combining Scattering and CNNs is novel and the results seem promising. I feel that such work was a missing piece in the Scattering literature to make it useful for practical applications.\r\n\r\nI wish the authors would have investigated the effect of the stable bottom layers with respect to adversarial examples. This can be done in a relatively straightforward way with software like cleverhans [1] or deep fool [2]. It would be very interesting if the first layer's robustness to adversarial examples carries over to the subsequent layers of the network. By investigating the effect of adversarial attacks on the learned operators in the deeper layers, the authors could gain further insights into the invariances learned by their hybrid architecture. Additionally, it would be interesting to see how the hybrid network performs on other benchmark datasets, such as ImageNet, to assess its scalability and generalizability. The authors mention that their approach saves computation time, but it would be helpful to provide more details on the computational efficiency compared to traditional deep networks. Finally, the release of the ScatWave software is a valuable contribution to the research community, and it would be beneficial to include a section in the paper discussing the implementation details and performance benchmarks of the software. Overall, this work is highly commendable and represents a significant advancement in combining scattering and convolutional neural networks for improved performance and stability in deep learning applications.","label":109}
{"id":"ae346176-5127-4d74-9c28-5a75aa62b8a7","text":"The program committee appreciates the authors' response to concerns raised in the reviews. Reviewers are generally excited about the combination of predefined representations with CNN architectures, allowing the model to generalize and perform as well as deep networks learned from scratch. The use of scattering as a fixed initialization for the first layers adds stability to the network against geometric transformations and noise. The authors also provide evidence that deep hybrid scattering networks generalize better on small datasets compared to supervised deep networks. Overall, reviewers are excited about the potential of this hybrid approach and commend the authors on developing the ScatWave software for fast computations of scattering networks integrated in Torch.","label":31}
{"id":"579674fc-d93b-4a06-976d-e993206f3758","text":"Dear reviewers,\r\n\r\nHere are two elements that I have added thanks to your constructive and helpful suggestions:\r\n\r\n- I have added a note in the Appendix B that quantifies precisely the additive perturbations of a I have added a note in the Appendix B that quantifies precisely the additive perturbations of a scattering network compared to a standard deep network. This provides additional insights into the robustness and stability of our hybrid architecture. The analysis highlights the advantages of using scattering as a generic and fixed initialization of the first layers, which ensures stability to geometric transformations and noise. Furthermore, I have included the results of our evaluation on the CIFAR10, CIFAR100, and STL10 datasets, showcasing the impressive performance of our hybrid scattering network compared to supervised deep networks. The experiments demonstrate that our model not only achieves nearly state-of-the-art results on standard benchmarks but also generalizes better on small datasets. We also provide an analysis of the third layer of our architecture, elucidating how the learned operators explicitly build invariances to local rotation and translation. In response to the potential concerns raised by some reviewers, I have also compared the scattering transform with a standard deep network. Although it is possible to replace the scattering transform, it would require learning more parameters and potentially introduce instabilities. Lastly, we are pleased to announce the release of our new software, ScatWave, which utilizes GPUs for fast computations of a scattering network integrated in Torch. Overall, these additions further strengthen the paper and address the valuable feedback provided by the reviewers.","label":33}
{"id":"c4a0d16c-b603-4352-bec0-76a6fe183d23","text":"In this paper, the authors explicitly design geometrical structure into a CNN by combining it with a Scattering network. This aids stability and limited-data performance. The paper is well written, the contribution of combining Scattering and CNNs is novel and the results seem promising. I feel that such work was a missing piece in the Scattering literature to make it useful for practical applications.\r\n\r\nI wish the authors would have investigated the effect of the stable bottom layers with respect to adversarial examples. This can be done in a relatively straightforward way with software like cleverhans [1] or deep fool [2]. It would be very interesting if the first layer's stability in the hybrid architectures increases robustness significantly, as this would suggest that the hybrid network not only improves generalization on small datasets but also enhances the network's ability to resist adversarial attacks. Additionally, I appreciate that the authors released a new software, ScatWave, which allows for fast computations of a scattering network integrated in Torch. This is a valuable contribution to the research community as it facilitates the adoption of scattering networks in practical applications. Overall, the paper presents a compelling approach that combines the strengths of scattering and CNNs, improving generalization and stability while maintaining efficiency. However, I believe that further investigation into the performance and robustness of the hybrid network in the presence of adversarial examples would greatly enhance the paper. This could potentially open up new avenues for using scattering networks in security-sensitive tasks. I look forward to seeing future research building upon these findings and exploring the potential applications and advantages of the proposed hybrid network architecture.","label":119}
{"id":"588db031-d22a-4683-ac03-6e922db4817a","text":"The paper investigates a hybrid network consisting of a scattering network followed by a convolutional network. By using scattering layers, the number of parameters is reduced, and the first layers are guaranteed to be stable to deformations.. This is an interesting approach as it combines the advantages of both scattering networks and convolutional networks. The authors demonstrate that by initializing the first layers with a scattering transform, they can achieve near-state-of-the-art results on standard benchmarks. This shows that the scattering transform can act as a generic and fixed initialization for deep networks, reducing the need to learn all layers from scratch.Furthermore, the authors show that deep hybrid scattering networks generalize better on small datasets compared to supervised deep networks. This is an important finding, as small datasets are often prone to overfitting in deep learning models. By leveraging the stability and invariances to geometric transformations and noise provided by the scattering layers, the hybrid network shows improved generalization performance. This could be particularly beneficial in domains where data scarcity is a challenge.The paper also investigates the learned operators in the hybrid architecture and demonstrates that they explicitly build invariances to geometrical variabilities such as local rotation and translation. This analysis provides insights into the representation learned by the network and highlights the benefits of incorporating scattering layers in the early stages of the network. Notably, the authors mention that replacing the scattering transform with a standard deep network would require learning more parameters and potentially introduce instabilities. Thus, the use of scattering layers seems to offer a more efficient and stable alternative.Finally, the authors release a new software, ScatWave, which utilizes GPUs for fast computations of a scattering network integrated in Torch. This is a valuable contribution as it enables researchers and practitioners to easily implement scattering networks in their work. The evaluation of the proposed hybrid model on benchmark datasets, namely CIFAR10, CIFAR100, and STL10, further showcases the performance of the approach.In conclusion, the paper presents a compelling hybrid network that combines scattering and convolutional layers. The experimental results indicate that this hybrid architecture achieves competitive performance while reducing the number of parameters and offering stability to geometric transformations. The analysis of the learned operators and the release of the ScatWave software add to the overall contribution of the paper. Overall, the work is well-written and makes a significant contribution to the field of deep learning.","label":37}
{"id":"3e979ef8-69e0-44a7-bc8b-442498dbe712","text":"Thanks a lot for your detailed response and clarifications.\r\n\r\nThe paper proposes to use a scattering transform as the lower layers of a deep network. This fixed representation enjoys good geometric properties (local invariance to deformations) and can be thought as a form of regularization or prior. The top layers of the network are trained to perform a given supervised task. This is a unique and novel approach that offers potential advantages over traditional deep networks. By incorporating the scattering transform as a fixed initialization for the lower layers, the network benefits from the stability and invariances provided by this representation. The authors provide compelling evidence of the improved generalization capabilities of deep hybrid scattering networks compared to supervised deep networks on smaller datasets. This finding has important implications for saving computation time while still maintaining high performance. Moreover, the paper presents a thorough analysis of the learned operators in the third layer, revealing their explicit built-in invariances to geometrical variabilities. This not only enhances our understanding of the inner workings of the proposed architecture but also adds valuable insights to the field. The authors acknowledge the potential drawback of replacing the scattering transform with a standard deep network, highlighting the need to learn more parameters and the possibility of introducing instabilities. Finally, the release of the ScatWave software, integrated in Torch and utilizing GPUs for fast computations, further strengthens the practicality of implementing the proposed model. Overall, the paper presents a well-executed and informative study that showcases the effectiveness and applicability of hybrid scattering networks on various benchmark datasets.","label":62}
{"id":"25c5f729-d2c9-4cfc-85b0-cf0561359d1c","text":"The paper extends the NTM by a trainable memory addressing scheme.\r\nThe paper also investigates both continuous\/differentiable as well as discrete\/non-differentiable addressing mechanisms.\r\n\r\nPros:\r\n* The paper does a good job in extending the NTM with a trainable memory addressing scheme and exploring both continuous\/differentiable and discrete\/non-differentiable mechanisms. The experiments on Facebook bAbI tasks show that the D-NTM outperforms NTM and LSTM baselines, which is impressive. Further experimental results on sequential MNIST, associative recall, and copy tasks provide additional evidence of the D-NTM's effectiveness. Overall, this paper presents a valuable contribution to the field of neural Turing machines.","label":22}
{"id":"b89de594-38a2-4da9-a5ea-14ec1f996956","text":"This paper proposes some novel architectural elements, and the results are not far from being impressive. The introduction of a trainable memory addressing scheme in the dynamic neural Turing machine (D-NTM) allows for the learning of various location-based addressing strategies. The experiments conducted on Facebook bAbI tasks demonstrate the superiority of the D-NTM over NTM and LSTM baselines. Furthermore, the additional experimental results on sequential MNIST, associative recall, and copy tasks reinforce the promising performance of the proposed approach.","label":14}
{"id":"ea00d0dd-3bf2-4a52-9183-0e44ce196d5d","text":"The authors proposed a dynamic neural Turing machine (D-NTM) model that overcomes the rigid location-based memory access used in the original NTM model. The paper has two main contributions: 1) introducing a learnable addressing to NTM. 2) curriculum learning using hybrid discrete and continuous attention. The proposed model was empirically evaluated on Facebook bAbI task and has shown improvement over the original NTM.\r\n\r\nPros:\r\n+ Comprehensive comparisons of feed-forward Pros:\r\n+ Comprehensive comparisons of feed-forward and GRU-controlled D-NTM architectures.\r\n+ The model outperforms NTM and LSTM baselines on the Facebook bAbI tasks and shows promising results on other sequential tasks.\r\n+ The introduction of a trainable memory addressing scheme allows the D-NTM to learn various location-based addressing strategies.\r\n+ The use of both continuous and discrete addressing mechanisms provides flexibility in addressing different types of tasks.\r\n\r\nCons:\r\n- The paper lacks a thorough discussion on the limitations of the proposed model.\r\n- There could be more detailed analyses of the performance improvements over the NTM and LSTM baselines.\r\n- The experiments could be expanded to include more diverse and challenging benchmark datasets.\r\n- The training time and computational resources required for the D-NTM are not discussed in detail.\r\n\r\nOverall, the paper presents a significant extension to the NTM model with the introduction of a dynamic addressing scheme. The empirical evaluation demonstrates the effectiveness of the proposed D-NTM, although there are some areas for improvement that could be addressed in future work.","label":67}
{"id":"35c6499a-4142-4c11-b485-8944c10c58d0","text":"Dear Reviewers and Readers,\r\n\r\nFor the codes of  the D-NTM, it is essential to ensure that they are well-documented and easily accessible for future researchers to replicate and build upon. Additionally, thorough benchmarking and comparison with other state-of-the-art memory-based models would further strengthen the paper's contribution. Furthermore, providing insights into the computational complexity and scalability of the D-NTM would be valuable. Overall, this paper presents an interesting extension to the NTM, and with some additional experiments and analysis, it has the potential to make a significant impact in the field.\r\n\r\n","label":7}
{"id":"6e8296a3-45ed-419b-a2aa-5c62dfddcce2","text":"This paper introduces a variant of the neural Turing machine (NTM, Graves et al. 2014) where key and values are stored. They try both continuous and discrete mechanisms to control the memory.\r\n\r\nThe model is quite complicated and seem to require a lot of tricks to work. Overall it seems that more than 10 different terms appear in the cost function and many different hacks are required to learn the model. It is hard to understand the justification for all of these tricks and sophisticated choices..However, despite the complexity of the model, the results presented in this paper are promising. The authors conduct experiments on Facebook bAbI tasks using both a feedforward and GRU-controller and compare the performance of the D-NTM to NTM and LSTM baselines. The D-NTM outperforms these baselines, demonstrating its effectiveness in learning location-based addressing strategies. Additionally, the model is evaluated on other tasks such as sequential MNIST, associative recall, and copy tasks, further highlighting its versatility. The experimental results show that the D-NTM is able to successfully read and write to memory, achieving state-of-the-art performance on these tasks. While some might argue that the numerous tricks and sophisticated choices required by the model make it less transparent and interpretable, the improved performance suggests that these techniques are necessary to fully capture the desired memory addressing capabilities. In conclusion, this paper presents a valuable extension to the NTM framework, providing a trainable memory addressing scheme that enables the D-NTM to learn a wide range of addressing strategies, outperforming baselines on various tasks. The work opens up avenues for future research in memory-augmented neural network architectures.","label":85}
{"id":"0811b5b8-9895-46a2-9458-3679d932f843","text":"The paper extends the NTM by a trainable memory addressing scheme.\r\nThe paper also investigates both continuous\/differentiable as well as discrete\/non-differentiable addressing mechanisms.\r\n\r\nPros:\r\n* Extension of the NTM with a trainable memory addressing scheme allows for a wide range of location-based addressing strategies, including linear and nonlinear ones. The inclusion of both continuous and discrete mechanisms provides a comprehensive understanding of their effects. The experiments on Facebook bAbI tasks, as well as the evaluation on sequential MNIST, associative recall, and copy tasks, demonstrate the superior performance of the D-NTM compared to NTM and LSTM baselines. Overall, this paper presents a significant advance in the field of neural Turing machines and provides valuable insights into the capabilities of dynamic addressing schemes.","label":23}
{"id":"1b48b3be-8f43-413b-9673-34209c83fa46","text":"The model described in his paper is quite complicated and reproducing the method from its description may be challenging: Are you planing to release the code and what is your estimate release date?\r\n\r\nDifferent cost functions and regularization are introduce in the paper. Would it be possible to summarize the overall cost function minimized by this model? \r\n\r\nSome variables seems to have different definition. In particular w_t and b, would it be possible to clarify this?\r\n\r\n\u201cgamma_t is a shallow MLP\u201d: What does it mean?.The model described in this paper is indeed complex, and reproducing it solely based on the description could be challenging. It would be helpful if the authors plan to release the code alongside the paper. The availability of the code will assist researchers and practitioners in effectively implementing and exploring the proposed methods. Hence, it would be great if the authors could provide an estimated release date for the code.\r\n\r\nThe paper introduces various cost functions and regularization techniques. To better understand the overall cost function minimized by this model, it would be valuable if the authors could provide a summary summarizing these components. This would enable the readers to grasp the key aspects of the model's optimization process.\r\n\r\nSome variables mentioned in the paper, such as w_t and b, appear to have different definitions or usages. It would significantly enhance the clarity and comprehension of the paper if the authors could clarify the specific meanings or roles of these variables in the context of the model.\r\n\r\nRegarding the statement, 'gamma_t is a shallow MLP,' it would be helpful if the authors could elaborate on what is meant by 'shallow MLP.' MLP typically stands for Multilayer Perceptron, which is a type of artificial neural network. However, the term 'shallow' might indicate a specific architectural characteristic or usage within the model. A clear explanation of this term would ensure a better understanding of the model's inner workings.\r\n\r\nOverall, this paper presents a promising extension of the neural Turing machine (NTM) called dynamic neural Turing machine (D-NTM). While the paper provides valuable insights into the proposed model and its performance on various tasks, addressing the aforementioned concerns would greatly enhance the clarity and reproducibility of the work.","label":83}
{"id":"b563e174-ab68-4618-8d6e-d12c36b66c89","text":".The paper presented a novel dynamic neural Turing machine (D-NTM) by introducing a trainable memory addressing scheme, consisting of content and address vectors. They implemented both continuous and discrete read\/write mechanisms and conducted experiments on Facebook bAbI tasks, sequential MNIST, associative recall, and copy tasks. The D-NTM outperformed NTM and LSTM baselines in terms of performance and showed the ability to learn various location-based addressing strategies.","label":0}
{"id":"059188a4-9edd-4b1b-8415-1c0d8f2ed349","text":".The paper presents the dynamic neural Turing machine (D-NTM) which incorporates a trainable memory addressing scheme. By utilizing separate content and address vectors for each memory cell, the D-NTM can leverage various location-based addressing strategies. Notably, it includes both continuous, differentiable and discrete, non-differentiable read\/write mechanisms. The authors evaluate the D-NTM on Facebook bAbI tasks, comparing its performance against NTM and LSTM baselines. The results demonstrate that D-NTM outperforms these baselines. Additionally, the paper presents experimental findings on other tasks such as sequential MNIST, associative recall, and copy tasks, further confirming the effectiveness and versatility of the D-NTM.","label":0}
{"id":"aa7b6ebf-387f-4ee2-a834-fba9e23f43cd","text":"This paper proposes some novel architectural elements, and the results are not far from published DNC results. However, the paper lacks in providing a detailed comparison between the continuous and discrete addressing schemes. Additionally, the authors should elaborate on the limitations and potential future directions for improving the D-NTM. Despite these shortcomings, the D-NTM shows promising performance on various tasks and the experiments are well-designed and thorough.","label":18}
{"id":"84aaafe9-ef58-47fd-9112-ae2effa2d322","text":"The authors proposed a dynamic neural Turing machine (D-NTM) model that overcomes the rigid location-based memory access used in the original NTM model..The proposed D-NTM model is an extension of the neural Turing machine (NTM) that addresses the limitations of rigid location-based memory access in the original NTM model. By introducing a trainable memory addressing scheme, the authors enabled the D-NTM to learn a wide variety of location-based addressing strategies, including both linear and nonlinear ones. This is achieved by maintaining separate content and address vectors for each memory cell. Additionally, the authors implemented the D-NTM with both continuous, differentiable read\/write mechanisms, as well as discrete, non-differentiable ones. Through thorough experimentation, utilizing Facebook bAbI tasks, a feedforward and GRU-controller were used to investigate the mechanisms and effects of learning to read and write to memory. The evaluation of the D-NTM on the Facebook bAbI tasks demonstrated its superior performance compared to the NTM and LSTM baselines. Furthermore, the D-NTM was also tested on sequential MNIST, associative recall, and copy tasks, providing additional experimental results. Overall, the D-NTM model presented in this paper offers a significant improvement over the original NTM model by introducing a trainable memory addressing scheme and achieving superior performance on various tasks.","label":23}
{"id":"d62b1208-813d-45d5-ae0f-6c7ce320f39d","text":"Dear Reviewers and Readers,\r\n\r\nFor the For the rest of the paper, the authors provide a thorough analysis of the D-NTM, comparing its performance to that of the NTM and LSTM baselines. The experimental results on the Facebook bAbI tasks demonstrate the superiority of the D-NTM, showcasing its ability to learn various location-based addressing strategies. Additionally, the D-NTM is evaluated on other tasks such as sequential MNIST, associative recall, and copy tasks, further demonstrating its effectiveness as a memory-based learning model. Overall, this paper presents a novel extension to the NTM and provides convincing evidence of its improved performance in various tasks.","label":5}
{"id":"0bbe4acc-feb9-4b12-ab50-4aec7d370a3e","text":"This paper introduces a variant of the neural Turing machine (NTM, Graves et al. 2014) where key and values are stored. They try both continuous and discrete mechanisms to control the memory.\r\n\r\nThe model is quite complicated and seem to require a lot of tricks to work..However, despite the complexity of the model and the need for various tricks, this paper presents promising results. The introduction of the trainable memory addressing scheme in the dynamic neural Turing machine (D-NTM) allows for a wide range of location-based addressing strategies, including both linear and nonlinear ones. By maintaining separate content and address vectors for each memory cell, the D-NTM demonstrates the ability to learn and adapt to different read and write mechanisms. The authors conduct experiments on Facebook bAbI tasks using both feedforward and GRU-controller implementations of the D-NTM. The results show that the D-NTM outperforms both the NTM and LSTM baselines in terms of task performance. Moreover, the D-NTM is further evaluated on sequential MNIST, associative recall, and copy tasks, providing additional evidence of its effectiveness. These experiments demonstrate the versatility and generalizability of the D-NTM across different domains and tasks. The authors also discuss the limitations of their approach, such as the need for extensive training and the potential for overfitting. Overall, this paper contributes to the advancement of neural Turing machines by introducing the D-NTM with continuous and discrete addressing schemes and providing empirical evidence of its superiority over existing models. The results and insights presented in this paper pave the way for future research in this area and inspire further exploration of memory-based models for various cognitive tasks.","label":46}
{"id":"46de8276-0984-4594-9601-203d088e493f","text":".The paper presents a novel extension to the neural Turing machine (NTM) called the dynamic neural Turing machine (D-NTM). The D-NTM introduces a trainable memory addressing scheme which includes both content and address vectors for each memory cell. This allows the D-NTM to learn various location-based addressing strategies, enabling it to outperform NTM and LSTM baselines on the Facebook bAbI tasks. The D-NTM is implemented with both continuous and discrete read\/write mechanisms, and further experimental results on sequential MNIST, associative recall, and copy tasks are provided. Overall, this study demonstrates the capability and effectiveness of the D-NTM in addressing memory tasks.","label":0}
{"id":"378d832a-dae1-40f0-b82f-52fbde2bbdf0","text":"The model described in his paper is quite complicated and reproducing the method from its description may be challenging: Are you planing to release the code and what is your estimate release date?\r\n\r\nDifferent cost functions and regularization are introduce in the paper. Would it be possible to summarize the overall cost function minimized by this model? \r\n\r\nSome variables seems to have different definition. In particular The model described in this paper is quite complicated and reproducing the method from its description may be challenging. It would be very helpful if the authors plan to release the code for their implementation. This would not only enable researchers to replicate their results more easily but also allow for further exploration and experimentation with the proposed method. Therefore, I would like to inquire about the authors' plan and if they have an estimated release date for the code.Additionally, the paper introduces various cost functions and regularization techniques. It would be useful if the authors could provide a concise summary of the overall cost function minimized by this model. Understanding the underlying objective function will help in comprehending the training process and its optimization.Furthermore, a few variables mentioned in the paper seem to have different definitions or interpretations. In particular, it would be helpful if the authors could clarify the specific meanings of these variables to avoid any confusion or misinterpretation.Overall, the paper presents an interesting extension of the neural Turing machine (NTM) called the dynamic neural Turing machine (D-NTM). The proposed approach introduces a trainable memory addressing scheme that enables location-based addressing strategies, including both linear and nonlinear ones. The D-NTM is implemented with both continuous, differentiable and discrete, non-differentiable read\/write mechanisms, and its performance is investigated on various tasks, including Facebook bAbI tasks, sequential MNIST, associative recall, and copy tasks.The evaluation results demonstrate that the D-NTM outperforms both the NTM and LSTM baselines on the Facebook bAbI tasks. Additionally, the authors provide further experimental results on other tasks, which showcase the versatility and effectiveness of the D-NTM.In conclusion, this paper presents a valuable contribution to the field of neural Turing machines by introducing the D-NTM and its trainable memory addressing scheme. With some clarifications on variable definitions and the possible release of the code, this work has the potential to facilitate further research and advancements in the area of memory-based neural networks.","label":65}
{"id":"b2842df2-fa28-4357-b9b4-5ea5e545bf5b","text":"The paper describes an extension of the HasheNets work, with several novel twists. Instead of using a single hash function, the proposed HFH approach uses multiple hash function to associate each \"virtual\" (to-be-synthesized) weight location to several components of an underlying parameter vector (shared across all layers)..The authors justify the use of multiple hash functions by explaining that it reduces the probability of hash collisions, which can lead to loss of information and degradation in the compression ratio. By associating each weight location with several components of an underlying parameter vector, the HFH approach aims to improve the overall compression performance. The paper also introduces the concept of a compression space, which is homological in nature. All layers of the deep net fetch hashed values from this compression space, further enhancing the compression capabilities of HFH.\r\n\r\nOne notable contribution of HFH is the inclusion of a small reconstruction network within the overall network architecture. This reconstruction network is responsible for recovering the weight entries from the compressed space. By training the reconstruction network jointly with the rest of the network, HFH is able to achieve high compression ratios without significant loss in prediction accuracy. The experiments conducted on various benchmark datasets demonstrate the effectiveness of HFH in terms of compression ratios and prediction accuracy.\r\n\r\nFurthermore, the authors mention that HFH can be viewed as an extension of the HashedNets methodology. In fact, HashedNets can be considered a special case or a degenerated scenario of HFH. The paper highlights that HFH outperforms HashedNets in terms of performance, indicating the improvement gained through the use of multiple hash functions.\r\n\r\nAnother noteworthy aspect of HFH is its ability to efficiently determine a desired compression ratio. Unlike exhaustive searching through a combinatory space configured by all layers, HFH leverages the homological hashing essence to efficiently figure out the appropriate compression ratio. This is a practical advantage in real-world scenarios where finding the optimal compression ratio can be a challenging and time-consuming task.\r\n\r\nOverall, the paper presents a comprehensive description of the HFH approach and its extensions to the HashedNets methodology. The experimental results provide strong evidence of the effectiveness of HFH in compressing deep neural networks while maintaining high prediction accuracy. The concept of homological hashing and the inclusion of a reconstruction network within the network architecture are valuable contributions. However, further details on the experimental setup, training procedure, and comparison with other compression techniques would enhance the paper's strength.","label":47}
{"id":"edb80724-1ba9-4f46-ac8f-319302fb4220","text":"This paper presents some interesting and potentially useful ideas, but multiple reviewers point out that the main appeal of the paper's contributions would be in potential follow-up work and that the paper as-is does not present a compelling use case for the novel ideas..Several reviewers have expressed interest in the potential applications of the novel structure proposed in this paper. However, they agree that the paper's current presentation does not sufficiently showcase the practical benefits of the approach. It is recommended that the authors provide concrete use cases or empirical evaluations to demonstrate the effectiveness and real-world applicability of the HFH method. Without such evidence, it may be challenging for readers to fully appreciate the significance of the contributions made in this work.","label":44}
{"id":"cd9fd77b-d566-46bf-a020-45db9ca098e0","text":"The paper proposed a very complex compression and reconstruction method (with additional parameters) for reducing the memory footprint of deep networks.\r\n\r\nThe authors show that this complex proposal is better than simple hashed net proposal. One question: Are there any additional computational costs associated with the use of multiple low-cost hash functions, and if so, how does it compare to the benefits obtained in terms of memory and energy consumption reduction? It would be helpful if the authors could provide a more detailed analysis of the trade-offs between the performance gain achieved through HFH and the additional computational resources required. Additionally, it would be interesting to see how HFH performs on more diverse and challenging benchmark datasets. Overall, the proposed HFH method seems promising in addressing the memory and energy consumption issues of deep neural networks, but further experimentation and analysis are necessary to fully validate its effectiveness and practicality in real-world scenarios.","label":37}
{"id":"9a285c59-2df5-4af1-83ca-bfd9664efe32","text":"The paper presents a method to reduce the memory footprint of a neural network at some increase in the computation cost. This paper is a generalization of HashedNets by Chen et al. (ICML'15) where parameters of a neural network are mapped into smaller memory arrays using some hash functions with possible collisions..This novel structure, called Homologically Functional Hashing (HFH), aims to address the growing complexity of deep neural networks (DNNs) and the increasing memory and energy consumption in industrial applications, especially on mobile devices. HFH uses multiple low-cost hash functions to map weight entries in a deep net to values in a compression space. These hashed values are fetched by all layers in the network, creating a homological compression space. To recover the original weight entries, a small reconstruction network is employed, which is trained jointly with the entire network. The experimental results on several benchmark datasets show that HFH achieves high compression ratios without significant loss in prediction accuracy, making it a promising approach for compressing DNNs.\r\n\r\nThe primary contribution of this paper is the introduction of HFH as a generalization of HashedNets, providing improved performance over the previous approach. The homological hashing essence enables efficient determination of a single desired compression ratio, eliminating the need for exhaustive searching in the combined space of all layers. This is particularly advantageous for fine-tuning the compression ratio to meet specific requirements. Furthermore, HFH demonstrates superior performance compared to HashedNets, indicating its potential for practical applications.\r\n\r\nHowever, there are a few areas that could be further explored and discussed in the paper. Firstly, the computational cost associated with HFH is briefly mentioned, but a more detailed analysis and comparison with other compression techniques would strengthen the paper. Additionally, the impact of HFH on training time and convergence should be investigated, as reducing the memory footprint could potentially affect the optimization process. Furthermore, the robustness of HFH to adversarial attacks and its ability to transfer learned knowledge should also be examined. Overall, the paper presents an innovative method for compressing DNNs and offers promising results, but further investigation and analysis are needed to fully evaluate its practicality and effectiveness.","label":52}
{"id":"befe3bc3-1619-4016-9d78-baa38d64d02c","text":"The paper describes an extension of the HasheNets work, with several novel twists. Instead of using a single hash function, the proposed HFH approach uses multiple hash function to associate each \"virtual\" (to-be-synthesized) weight location to several components of an underlying parameter vector (shared across all layers). These components are then passed through a small MLP to synthesize the final weight.\r\n\r\nThis is an interesting and novel idea, and the experiments demonstrate that it improves substantially over HashedNets. However, HashedNets is not a particularly compelling technique for neural network model compression, especially when compared to other state-of-the-art compression techniques. The authors acknowledge this limitation and position HFH as an improvement over HashedNets. The use of multiple hash functions in HFH allows for better association of weight locations and components of the parameter vector, leading to improved synthesis of the final weight. This is consistent with the intuition that using multiple hash functions can capture different aspects of the weight information, enhancing the overall compression performance. The experimental results on benchmark datasets validate the effectiveness of HFH in achieving high compression ratios while maintaining prediction accuracy. The comparison with HashedNets shows that HFH outperforms it significantly, indicating that the proposed approach is a more powerful compression technique for deep neural networks. Additionally, the homological hashing essence of HFH offers the advantage of efficiently determining a desired compression ratio, which reduces the need for exhaustive searching. This is a notable contribution as it addresses a practical concern in the field of model compression. Overall, the paper presents a well-designed and innovative method with strong experimental results. The novelty lies in the application of homologically functional hashing in compressing DNNs and the use of multiple hash functions for weight synthesis. I believe that this work will be of interest to researchers and practitioners working in the field of deep learning.","label":93}
{"id":"075614c3-8f25-489c-b4e6-f805ae4c439d","text":"The paper describes an extension of the HasheNets work, with several novel twists. Instead of using a single hash function, the proposed HFH approach uses multiple hash function to associate each \"virtual\" (to-be-synthesized) weight location to several components of an underlying parameter vector (shared across all layers)..The use of multiple hash functions in the HFH approach is an interesting and innovative idea. By associating each virtual weight location with several components of an underlying parameter vector, the authors are able to achieve higher compression ratios with minimal loss in prediction accuracy. This is particularly evident in the comparison to the HashedNets approach, where HFH shows significantly improved performance. One of the key contributions of HFH is the introduction of the homological hashing essence, which allows for the efficient determination of a desired compression ratio. This is important as it eliminates the need for exhaustive searching throughout a combinatory space configured by all layers. By leveraging the compression space and the reconstruction network, HFH is able to find the optimal compression ratio that balances the reduction in memory and energy consumption with the preservation of prediction accuracy.The experimental results on several benchmark datasets demonstrate the effectiveness of HFH. The high compression ratios achieved by HFH, coupled with the minimal loss in prediction accuracy, make it a promising approach for compressing deep neural networks. The fact that HFH outperforms HashedNets, which is a widely used method for network compression, further validates the superiority of the proposed approach.However, there are a few aspects that could be further discussed in the paper. Firstly, while the experimental results show the improvement of HFH over HashedNets, it would be interesting to know how HFH compares to other state-of-the-art compression techniques. Secondly, more details about the reconstruction network used in HFH would be helpful. How is it designed and trained? Are there any limitations or bottlenecks associated with its utilization?In conclusion, the paper presents a novel and effective approach, HFH, for compressing deep neural networks. It builds upon the HasheNets work and introduces multiple low-cost hash functions for associating weight locations with components of an underlying parameter vector. The experimental results demonstrate the superiority of HFH over HashedNets and its ability to achieve high compression ratios with minimal loss in prediction accuracy. With some additional discussion on the comparison to other compression techniques and further explanation of the reconstruction network, the paper would be even more impactful.","label":47}
{"id":"59dbe2ad-a109-4253-948d-986ea34d2b38","text":"This paper presents some interesting and potentially useful ideas, but multiple reviewers point out that the main appeal of the paper's contributions would be in potential follow-up work. While the concept of homologically functional hashing for compressing deep neural networks is intriguing, the paper lacks in-depth experimental validation and comparative analysis with other compression methods. Additionally, reviewers have emphasized the importance of addressing the trade-off between compression ratio and prediction accuracy, as well as the potential impact of HFH on training time and computational complexity. Further research and experimentation are required to fully evaluate the practicality and effectiveness of HFH in real-world scenarios. Overall, the paper presents promising ideas that warrant further exploration and refinement.","label":27}
{"id":"f33e6515-4fdf-42cb-b96c-6018b96a3fcb","text":"The paper proposed a very complex compression and reconstruction method (with additional parameters) for reducing the memory footprint of deep networks.\r\n\r\nThe authors show that this complex proposal is better than simple hashed net proposal..However, while the authors demonstrate that their proposed HFH method outperforms the simple HashedNets proposal in terms of compression ratios and prediction accuracy, it is important to note that the complexity of the HFH method may introduce additional computational overhead. The use of multiple low-cost hash functions and the inclusion of a small reconstruction network likely require additional resources compared to the simpler HashedNets approach. It would be beneficial for the authors to provide more detailed analysis and evaluation of the computational cost of the HFH method, as this is a crucial factor to consider, particularly for mobile device applications where memory and energy consumption are major concerns. Additionally, the authors should provide more insights into the limitations and potential drawbacks of the HFH method, as a comprehensive understanding of these aspects is essential for the practical adoption of this compression technique. Overall, the paper presents an interesting and novel approach to compressing deep neural networks using homologically functional hashing, but further analysis and experimentation are required to fully evaluate its scalability, resource requirements, and generalizability to different network architectures and datasets.","label":34}
{"id":"76730557-25e9-4bcf-827a-72c84ea64bb4","text":"The paper presents a method to reduce the memory footprint of a neural network at some increase in the computation cost. This paper is a generalization of HashedNets by Chen et al. (ICML'15) where parameters of a neural network are mapped into smaller memory arrays using some hash functions with possible collisions. Instead of training the original parameters, given a hash function, the elements of the compressed memory arrays are trained using back-propagation. In this paper, some new tricks are proposed including: (1) the compression space is shared among the layers of the neural network (2) multiple hash functions are used to reduce the effects of collisions (3) a small network is used to combine the predictions of the compressed network. The experimental results show that the proposed HFH method achieves high compression ratios while maintaining little loss in prediction accuracy compared to the original deep neural network. The authors also compare HFH with the HashedNets method and demonstrate that HFH outperforms HashedNets in terms of compression and accuracy metrics. One of the strengths of HFH is its ability to efficiently determine the desired compression ratio by iteratively adjusting the hash function parameters and analyzing the compression space. This eliminates the need for exhaustive search and reduces the computational burden. The paper is well-written and provides clear explanations of the proposed method and experimental setup. The experimental results are comprehensive, presenting the performance of HFH on several benchmark datasets. Additionally, the authors provide insightful discussions on the limitations and future directions of HFH, highlighting potential areas for improvement. Overall, this paper makes a valuable contribution to the field of deep neural network compression and provides a novel approach using homologically functional hashing. I recommend accepting this paper for publication.","label":115}
{"id":"49dce4d0-478a-446a-8263-d88cbe455845","text":"The paper describes an extension of the HasheNets work, with several novel twists. Instead of using a single hash function, the proposed HFH approach uses multiple hash function to associate each \"virtual\" (to-be-synthesized) weight location to several components of an underlying parameter vector (shared across all layers). These components are then passed through a small MLP to synthesize the final weight.\r\n\r\nThis is an interesting and novel idea, and the experiments demonstrate that it improves substantially over HashedNets. However, HashedNets is not a particularly compelling technique for neural network model compression, especially when compared with more recent work on pruning- and quantization-based approaches. The experiments in this paper demonstrate that the proposed approach yields worse accuracy at worse accuracy at higher compression ratios compared to these pruning- and quantization-based approaches. Additionally, the paper lacks a thorough comparison with other state-of-the-art compression methods, which limits the understanding of the overall performance of HFH. It would be beneficial if the authors could provide more insights into the limitations of HFH and discuss potential areas for improvement. Another concern is the computational overhead introduced by HFH due to the multiple hash functions and the reconstruction network. Although the authors mention that HFH uses low-cost hash functions, the impact on training and inference time is not clear from the experiments. It would be helpful to include some quantitative analysis or comparisons with other compression approaches in terms of computational requirements. Overall, the paper presents an interesting idea with the HFH framework for compressing DNNs. The experiments show promising results regarding compression ratios and prediction accuracy. However, further analysis and comparisons with other compression methods are needed to establish the strength and limitations of HFH in the context of deep neural network compression.","label":116}
{"id":"441c9f28-6134-4eea-b875-ef1fc0b926ea","text":"Authors present a parameterized variant of ELU and show that the proposed function helps to deal with vanishing gradients in deep networks in a way better than existing non-linearities. They demonstrate the effectiveness of the proposed Parametric ELU (PELU) activation function by conducting multiple experiments on different network architectures and datasets. Their results show significant improvements in relative error, outperforming the standard ELU function on CIFAR-10\/100 and ImageNet. Additionally, they observed that Vgg using PELU had activations saturating closer to zero, similar to ReLU, which further supports the benefits of PELU in deep learning. Furthermore, the authors suggest that adjusting the shape of activations during training can help control vanishing gradients and bias shift, making learning easier. Overall, this paper presents a valuable contribution in improving activation functions for deep convolutional neural networks.","label":30}
{"id":"b4a610f4-7bdb-45cd-b514-7814fee8a6e1","text":"The paper describes an important component in Convolutional Neural Networks (CNNs). For instance, recent breakthroughs in Deep Learning can be attributed to the Rectified Linear Unit (ReLU). Another recently proposed activation function, the Exponential Linear Unit (ELU), has the supplementary property of reducing bias shift without explicitly centering the values at zero. In this paper, we show that learning a parameterization of ELU improves its performance. We analyzed our proposed Parametric ELU (PELU) in the context of vanishing gradients and provide a gradient-based optimization framework.","label":3}
{"id":"bf044a6a-7fa4-4a2c-a5b3-af84ccc6c4dc","text":"The paper deals with a very important issue of vanishing gradients and the quest for a perfect activation function. Proposed is an approach of learning the activation functions during the training process. I find this research very interesting, but I am concerned that the paper is a bit premature.\r\n\r\nThere is a long experimental section, but I am not sure what the conclusion is. The authors appear to be somewhat confused themselves. The amount of experimental results presented in the paper is overwhelming and it becomes difficult to determine the key findings. Additionally, the authors mention the performance improvements of PELU over ELU, but it would be beneficial to include a comparison with other popular activation functions used in deep convolutional neural networks. This would provide a more comprehensive analysis and better establish the superiority of PELU. Furthermore, the authors touch upon the issue of controlling vanishing gradients and bias shift with varying activation shapes, but they do not provide any concrete strategies or guidelines for choosing the optimal shapes. It would be valuable to delve deeper into this aspect and provide more insights for future research. Overall, while the proposed Parametric ELU is a promising direction, the paper would benefit from further clarification and additional comparisons to strengthen its contributions.","label":73}
{"id":"e095ccbe-50cd-4bd5-919a-87fe94498a99","text":"This paper presents a new non-linear function for CNN and deep neural networks called Parametric Exponential Linear Unit (PELU). The authors analyze its performance in the context of vanishing gradients and provide a gradient-based optimization framework. They conduct experiments on CIFAR-10\/100 and ImageNet, comparing PELU to ELU and showing relative error improvements. The results demonstrate that PELU outperforms ELU with minimal parameter increase. Additionally, the authors observe that Vgg using PELU prefers activations saturating near zero like ReLU, except at the last layer where it saturates near -2. The findings also highlight the importance of varying activation shape and other parameters in controlling vanishing gradients and bias shift during training.","label":11}
{"id":"945195cb-ccd7-4cfe-b564-9789e3dc0d4d","text":"This paper proposes a modification of the ELU activation function for neural networks, by parameterizing it with 2 trainable parameters per layer. This parameter is proposed to more effectively counter vanishing gradients. \r\n\r\nMy main concern regarding this My main concern regarding this paper is the lack of comparison with other activation functions. While the authors claim that their proposed Parametric ELU (PELU) outperforms the standard ELU on CIFAR-10\/100 and ImageNet, it would have been valuable to see how PELU compares to other popular activation functions such as ReLU, Leaky ReLU, and sigmoid. Without these comparisons, it is difficult to assess the true effectiveness of PELU and understand how it stands against other state-of-the-art activation functions.\r\n\r\nFurthermore, the paper states that PELU has a relative error improvement over ELU on CIFAR-10\/100 and ImageNet. However, it would have been more informative if the authors had provided a more detailed analysis of the performance improvements and the significance of these improvements. For example, are the observed improvements statistically significant? Additionally, it would be helpful to understand the reasons behind the observed improvements and how parameterizing the ELU contributes to the enhanced performance.\r\n\r\nAnother aspect that could be further explored is the generalizability of PELU across different network architectures. The paper mentions that experiments were conducted with various network architectures, but there is limited discussion on how PELU performs in comparison to ELU and other activation functions specifically for each architecture. Providing more detailed insights into the performance of PELU across different architectures would improve the comprehensiveness of the paper.\r\n\r\nOverall, the idea of parameterizing the ELU activation function to improve its performance and address vanishing gradients is interesting. However, to make the findings of this paper more significant and impactful, I believe that additional comparison experiments, more detailed analysis of the performance improvements, and further investigation into the generalizability of PELU across different network architectures are necessary.","label":37}
{"id":"0612d9ea-be10-4498-a211-d7170cde7894","text":".The paper 'Parametric Exponential Linear Unit for Deep Convolutional Neural Networks' addresses the significance of the activation function in CNNs. It expands upon the widely used Rectified Linear Unit (ReLU) and introduces the Exponential Linear Unit (ELU), which reduces bias shift without centering values at zero. The authors propose a parameterization of ELU called Parametric ELU (PELU) and demonstrate its effectiveness in mitigating the issue of vanishing gradients. They present experimental results on CIFAR-10\/100 and ImageNet, comparing PELU with ELU and showcasing significant error improvements. Additionally, the authors note that PELU tends to exhibit similar saturation characteristics as ReLU, except for the last layer. The findings suggest that adjusting activation shape during training can aid in controlling vanishing gradients and bias shift, facilitating the learning process.","label":0}
{"id":"1f936753-6b73-4f8e-89f9-ac4177856647","text":"Authors present a parameterized variant of ELU and show that the proposed function helps to deal with vanishing gradients in deep networks in a way better than existing non-linearities. They propose an optimization framework for learning the parameterization of the Parametric ELU (PELU) and conduct comprehensive experiments on various network architectures such as NiN, Overfeat, All-CNN, and ResNet. They demonstrate that PELU outperforms ELU with relative error improvements of 4.45% and 5.68% on CIFAR-10 and 100, respectively. Moreover, PELU achieves a remarkable 7.28% improvement on ImageNet with a negligible increase in parameters. Additionally, the authors observe that Vgg using PELU exhibits activations saturating closer to zero, resembling ReLU, except for the last layer which saturates near -2. These findings highlight the effectiveness of PELU in controlling vanishing gradients and bias shift, thereby facilitating optimized learning.","label":30}
{"id":"ab8fe463-2666-4983-9c85-3cec8db78dfe","text":"The paper describes a parametric version of the exponential linear unit (ELU) activation function..The proposed parametric version of the exponential linear unit (ELU) activation function is thoroughly explained and its performance is demonstrated through various experiments.","label":14}
{"id":"b13b2aca-ba71-4b5f-ba45-ea07fce26a50","text":"The paper deals with a very important issue of vanishing gradients and the quest for a perfect activation function. Proposed is an approach of learning the activation functions during the training process. I find this research very interesting, but I am concerned that the paper is a bit premature.\r\n\r\nThere is a long experimental section, but I am not sure what the conclusion is. The authors appear to have analyzed the results extensively, but they haven't clearly stated the implications or significance of these findings. It would be helpful if the authors could provide a more thorough discussion of the implications of their experimental results. Additionally, the paper lacks a comparison with other existing activation functions like Leaky ReLU or Parametric ReLU, which would provide a more comprehensive evaluation of the proposed Parametric Exponential Linear Unit (PELU). Furthermore, the authors mention that PELU helps in controlling vanishing gradients and bias shift, but they should provide more evidence or analysis to support this claim. Overall, I believe this paper has potential, but it requires further refinement in terms of clearer conclusions, comparison with other activation functions, and more in-depth analysis of the claimed benefits of PELU.","label":66}
{"id":"aaae63c3-1fd5-4072-b760-de7a3f20e682","text":".In this paper, the authors propose the Parametric Exponential Linear Unit (PELU) as an improved activation function for deep convolutional neural networks. They show that learning a parameterization of ELU enhances its performance, particularly in terms of reducing bias shift. The authors conduct experiments on CIFAR-10\/100 and ImageNet datasets using different network architectures, and demonstrate that PELU outperforms ELU with relative error improvements. Additionally, they observe that Vgg using PELU exhibits similar activation patterns to ReLU, except at the last layer. The results also suggest that adjusting the shape of activations during training aids in controlling vanishing gradients and bias shift, facilitating learning.","label":0}
{"id":"c042f3f1-302e-4f9f-bbfe-da004d3db0bd","text":"This paper proposes a modification of the ELU activation function for neural networks, by parameterizing it with 2 trainable parameters per layer. This parameter is proposed to more effectively counter vanishing gradients. \r\n\r\nMy main concern regarding this paper is related to the authors' claims about the effectiveness of PELU. The analysis in Sections 2 and 3 discusses how PELU might improve training by combating gradient propagation issues. This by itself does not imply that improved generalization will result, only that models may be easier to train. However, the experiments all seek to demonstrate improved generalization performance.\r\nBut this could in principle be due to a better inductive bias, and have nothing to do with the optimization analysis. None of the experiments are designed to directly support the stated theoretical advantage of optimization analysis. The authors should consider performing additional experiments to directly test the proposed theoretical advantage of PELU. For example, they could compare the performance of PELU against other activation functions with similar optimization properties, such as SELU or Swish. Additionally, the authors should conduct experiments on a wider range of datasets to ensure that the observed improvements are not dataset-specific. Moreover, it would be beneficial to investigate the impact of the PELU parameterization on different network architectures, besides the ones mentioned in the paper. This would provide a more comprehensive evaluation of the proposed method. Furthermore, the authors should provide a more detailed discussion on the potential limitations and drawbacks of PELU. Are there any scenarios where PELU may not be as effective or beneficial? Finally, it would be interesting to explore the interpretability of PELU and how the learned parameters correspond to the behavior of the activation function. Overall, this paper presents an interesting modification to the ELU activation function and the experiments demonstrate promising results. However, additional experiments and analysis are needed to fully support the claims made by the authors and provide a more comprehensive understanding of the proposed Parametric ELU.","label":130}
{"id":"5b36d248-83d5-4b06-bf8a-6a34094ccdd7","text":"Authors present a parameterized variant of ELU and show that the proposed function helps to deal with vanishing gradients in deep convolutional neural networks. They conducted experiments on CIFAR-10\/100 and ImageNet with different network architectures, including NiN, Overfeat, All-CNN, and ResNet. The results demonstrated that the proposed Parametric ELU (PELU) outperformed the standard ELU, with relative error improvements of 4.45% and 5.68% on CIFAR-10 and 100, respectively. Additionally, on ImageNet, PELU achieved a remarkable improvement of 7.28% with a minimal parameter increase of only 0.0003%. The authors also observed that PELU tended to prefer activations saturating closer to zero, similar to ReLU, except at the last layer, where it saturated near -2. Moreover, by varying the shape of activations during training, the authors found that they could effectively control vanishing gradients and bias shift, thereby facilitating learning. Overall, the paper provides valuable insights into the benefits of using a parameterization of ELU in deep CNNs and offers a comprehensive evaluation of its performance across various network architectures and datasets.","label":21}
{"id":"cdffa1a9-6500-44ee-b930-01f0ea7e1126","text":"Strengths\r\n\uf06e-- An interesting proposal for a smaller CNN architecture designed for embedded CNN applications. \r\n\uf06e-- Balanced exploration of CNN macroarchitecture and microarchitecture with fire modules.\r\n\uf06e-- x50 less memory usage than AlexNet, keeping similar accuracy \r\n\uf06e-- strong experimental results\r\n\r\nWeaknesses\r\n\uf06e--Would be nice to test Sqeezenet on multiple tasks\r\n\r\n\uf06e--lack of insights and rigorous analysis into what factors are responsible for the similar accuracy achieved by SqueezeNet. Additionally, the lack of insights and rigorous analysis into what factors are responsible for the performance of SqueezeNet is a weakness that could be addressed in future research. It would also be beneficial to test SqueezeNet on multiple tasks to further evaluate its effectiveness and versatility. Overall, the paper presents an interesting proposal for a smaller CNN architecture with promising results, but further investigation and analysis are needed to fully understand and exploit its potential.","label":57}
{"id":"dd217b70-9b8f-45b7-8c87-371911fb450b","text":"The paper proposes a ConvNet architecture (\"SqueezeNet\") and a building block (\"Fire module\") aimed at reducing the model size while maintaining the AlexNet level of accuracy..The paper provides a compelling solution to the problem of reducing model size while maintaining accuracy. By proposing the SqueezeNet architecture and the Fire module, the authors demonstrate the potential for significant reduction in parameters and model size, which has practical implications in various domains such as distributed training, autonomous vehicles, and resource-constrained hardware. It would be interesting to see further experiments and comparisons with other state-of-the-art architectures to fully evaluate the effectiveness of SqueezeNet.","label":26}
{"id":"b26606bf-8437-46cb-9f98-9edc47be5d62","text":"Strengths\r\n\uf06e-- An interesting proposal for a smaller CNN architecture designed for embedded CNN applications. \r\n\uf06e-- Balanced and well-written abstract that clearly explains the motivation and advantages of the proposed SqueezeNet architecture. The paper successfully highlights the importance of smaller DNN architectures in terms of reduced communication in distributed training, limited memory hardware deployment, and exporting models to autonomous cars. The claim of achieving AlexNet-level accuracy with 50x fewer parameters is impressive and makes the paper more appealing for real-world applications. Furthermore, the promise of compressing SqueezeNet to less than 0.5MB using model compression techniques is intriguing and adds to the potential practicality of the proposed architecture. Overall, this paper seems to address an important problem and presents a compelling solution, and I look forward to reviewing the rest of the paper to gain further insights into the methodology and experimental results.","label":16}
{"id":"5ce0dc79-4456-4e38-9da3-39598b8b1917","text":"Summary: The paper presents a smaller CNN architecture called SqueezeNet for embedded deployment. The paper explores CNN macroarchitecture and microarchitecture to develop SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. The authors also demonstrate that by using model compression techniques, they can further reduce the size of SqueezeNet to less than 0.5MB, making it extremely suitable for resource-constrained environments. The paper highlights the advantages of smaller DNN architectures, such as reduced communication during distributed training, lower bandwidth requirements for model deployment, and compatibility with hardware with limited memory. Overall, the paper presents a compelling solution for efficient and compact deep neural networks, with promising applications in various domains.","label":23}
{"id":"d7c2be54-f4c1-4aaa-82c9-32d4114aa002","text":"The Squeezenet paper came out in Feb 2016, and I read it with interest. It has a series of completely reasonable engineering suggestions for how to save parameter memory for CNNs for object recognition (imagenet). The suggestions make a lot of sense, and provide an excellent compression of the model size. By achieving AlexNet-level accuracy with 50x fewer parameters, SqueezeNet addresses the challenge of reducing communication across servers during distributed training. This is crucial for large-scale deployments, as it minimizes the bandwidth required to export models from the cloud to autonomous cars. Moreover, the model's small size makes it highly suitable for deployment on hardware with limited memory, such as FPGAs. The authors even go a step further and demonstrate the effectiveness of model compression techniques, which enable SqueezeNet to be compressed to an incredibly small size of less than 0.5MB. This makes it 510 times smaller than the original AlexNet model, making it easier to store and transfer. Overall, SqueezeNet presents a valuable contribution to the field by providing a practical and efficient solution for reducing model size without sacrificing accuracy.","label":48}
{"id":"526569aa-a0d2-4f62-90d5-1c8c364e5708","text":".The paper 'SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size' proposes a small DNN architecture named SqueezeNet that achieves AlexNet-level accuracy on ImageNet with significantly fewer parameters. The authors highlight the advantages of smaller DNN architectures, including reduced communication during distributed training, lower bandwidth requirements for exporting models, and improved deployment feasibility on limited-memory hardware like FPGAs. These advantages make SqueezeNet an appealing solution in resource-constrained scenarios. The paper also mentions the use of model compression techniques, which enables further compression of SqueezeNet to less than 0.5MB. Overall, this work presents an innovative approach to address the challenges associated with deep neural networks and offers practical solutions for efficient and effective model deployment. The experimental results and comparisons with AlexNet provide strong evidence of the effectiveness of the proposed method. However, more detailed evaluations on different datasets and hardware platforms would further validate the claims made in the paper.","label":0}
{"id":"3537b9f6-5f89-4edf-b7f2-5ce2dbf2cecd","text":"The paper proposes a ConvNet architecture (\"SqueezeNet\") and a building block (\"Fire module\") aimed at reducing the model size while maintaining the accuracy. The authors claim that SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. They also mention the use of model compression techniques to further reduce the size of SqueezeNet to less than 0.5MB, which is 510x smaller than AlexNet. However, the partial review does not provide further details on the proposed ConvNet architecture or the Fire module. It would be helpful to include a more comprehensive analysis of the experimental results and comparison with other state-of-the-art architectures. Additionally, it would be interesting to explore the computational efficiency and deployment feasibility of SqueezeNet on different hardware platforms.","label":21}
{"id":"2ac00d9d-d61a-4215-9b3e-d89a1b7db549","text":"Strengths\r\n\uf06e-- An interesting proposal for a smaller CNN architecture designed for embedded CNN applications. \r\n\uf06e-- Balanced exploration of CNN macroarchitecture and microarchitecture, demonstrating the potential for significantly reducing model size while maintaining accuracy levels. The experimental results are impressive, with SqueezeNet achieving AlexNet-level accuracy on ImageNet with just 50x fewer parameters. The authors also address the practical implications of smaller DNN architectures, highlighting the reduced communication requirements during distributed training, less bandwidth needed for exporting models, and the feasibility of deployment on memory-limited hardware like FPGAs. These advantages make SqueezeNet a promising solution for various applications, particularly in resource-constrained environments. The authors further showcase their model compression techniques, achieving a remarkable compression ratio of 510x smaller than AlexNet, resulting in a model size of less than 0.5MB. Overall, this paper presents a valuable contribution to the field of deep neural networks and provides a strong case for the adoption of smaller DNN architectures like SqueezeNet.","label":21}
{"id":"e6892dd0-547a-4c10-b7a3-f22ecf131506","text":"Summary: The paper presents a smaller CNN architecture called SqueezeNet for embedded deployment. The paper explores CNN macroarchitecture and microarchitecture to develop SqueezeNet, which is composed of fire modules.\r\n\r\nPros: \r\nAchieves x50 less memory usage than AlexNet while keeping similar accuracy.\r\n\r\nCons & Questions:\r\nComplex by-pass es fire modules that make the architecture complex. It would be helpful to provide more clarity on how the by-pass is implemented and how it contributes to the overall performance of SqueezeNet. Additionally, it would be interesting to see a comparison between SqueezeNet and other compact CNN architectures in terms of inference time and memory usage. Overall, the paper presents a promising approach to reducing the model size while maintaining accuracy, but further experimentation and analysis would strengthen the findings.","label":43}
{"id":"bc45593e-490f-4aa4-8806-381e11904357","text":"The Squeezenet paper came out in Feb 2016, and I read it with interest. It has a series of completely reasonable engineering suggestions for how to save parameter memory for CNNs for object recognition (imagenet). The suggestions make a lot of sense, and provide an excellent compression of about 50x versus the original AlexNet model. The authors demonstrate that SqueezeNet achieves comparable accuracy while drastically reducing the number of parameters. This is a significant achievement as it not only saves memory but also offers practical advantages such as reduced communication during distributed training and easier deployment on resource-constrained hardware. Furthermore, the use of model compression techniques enables the authors to compress SqueezeNet to an impressively small model size of less than 0.5MB. The paper presents a well-justified and well-engineered approach to address the challenges of deep neural network architectures for applications like image recognition. The results are compelling and the proposed SqueezeNet architecture opens up new possibilities for deploying efficient and lightweight models in various real-world scenarios.","label":51}
{"id":"c96467d1-8977-4263-8530-3b882c4ae049","text":"This work proposes to use visualization of gradients to further understand the importance of features (i.e. pixels) for visual classification. Overall, this presented visualizations are interesting, however, the approach is very ad hoc..The paper addresses an important challenge in feature importance quantification in nonlinear deep networks. The use of interior gradients to capture feature importance is a novel approach that shows promise. The authors provide evidence of the widespread saturation phenomenon in various networks, which strengthens the significance of their proposed method. Moreover, the visualization of interior gradients offers a clear understanding of how important features are distributed within the network. The attribution property of interior gradients is another compelling aspect, ensuring that the feature importance scores align with the prediction score. Lastly, the ease of computing interior gradients compared to previous methods is a significant advantage, as it facilitates practical adoption. While the approach may be seen as ad hoc, it presents meaningful insights into feature importance and warrants further exploration.","label":33}
{"id":"f1110b4c-685c-41b8-8bf0-69fd79cd64f4","text":"This paper was reviewed by 3 experts. All 3 seem unconvinced of the contributions, point to several shortcomings, and recommend rejection. I see no basis for overturning their recommendation..I agree with the previous reviewers that the paper's contributions are not convincing and that there are several shortcomings. Based on their recommendations, I also recommend rejecting the paper.","label":29}
{"id":"108b60e6-06e6-4821-9eee-29dfab63ec3b","text":"We added two new sections (2.5 and 2.6) to the paper..The new sections (2.5 and 2.6) provide valuable insights into the application of interior gradients in different deep network architectures. They discuss the results obtained from applying the proposed method to the GoogleNet architecture for object recognition, as well as a ligand-based virtual screening network and an LSTM based language model. These additions demonstrate how interior gradients effectively capture feature importance and highlight their wide applicability in various deep networks. Overall, the inclusion of these sections strengthens the paper and enhances its practical relevance.","label":11}
{"id":"7d3f360d-ad69-4206-a020-fbd795a84c02","text":"We thank the reviewers for a detailed review.  The rebuttal below addresses some of the mentioned concerns.\r\n\r\nRegarding \u201cfar too long\u201d and \u201cunnecessarily grandiose name for literally, a scaled image\u201d: \r\n\r\nWe\u2019d agree that the paper is long for the ideas in it. The length stems from the difficulty of not having a crisp evaluation technique for feature importance. So we try to resort to qualitative discussions together with images. But we  can definitely try to tighten the writing. We are open to changing the title of the paper to \u201cInterior Gradients\u201d or something like it, though it is worth noting that while scaling intensities seems natural for images, analogous scaling for Text or Drug Discovery models results in inputs that are more obviously fake, i.e., counterfactual.\r\n\r\nRegarding \u201chow the proposed scheme for feature importance ranking is useful\u201d: \r\n\r\nWhile debugging deep networks is hard in general, examining feature importance scores offers a limited but useful insight into the operation of the network on a particular input. For us, the experience with the Drug Discovery network where we found, via our attributions, that the bond features were severely underused (see Section 3.1) was a concrete instance of how feature importance analysis could help debug and improve networks. As we discussed in section 2.7, we do mention the limitations of our paper, we would like to clarify that our proposed scheme for feature importance ranking is useful in providing insights into the operation of deep networks on specific inputs. While debugging deep networks can be challenging, understanding the importance of individual features can help in identifying areas for improvement and optimization. In our study, we found that the bond features in the Drug Discovery network were severely underused, which allowed us to make adjustments and enhance the network's performance (refer to Section 3.1 for details). This concrete example demonstrates how feature importance analysis can contribute to the debugging and refinement of deep networks.\r\n\r\nFurthermore, it is important to note that our method of interior gradients is applicable to a wide range of deep networks, including image recognition, categorical features in virtual screening, and language models. This versatility makes interior gradients a valuable tool for understanding feature importance in diverse domains. We also emphasize that our approach has the attribution property, ensuring that the calculated feature importance scores sum up to the prediction score, adding to the reliability of our method.\r\n\r\nLastly, we appreciate the feedback regarding the length of our paper and the grandiose name. We acknowledge that the paper can be condensed without compromising the clarity of our ideas. We will make efforts to tighten the writing and provide a concise description of our proposed method. We are open to renaming the paper to something more descriptive and appropriate, such as 'Interior Gradients,' capturing the essence of our approach while avoiding unnecessary speculation. Thank you once again for your valuable feedback, and we will take it into careful consideration while revising our paper.","label":216}
{"id":"e1bbb6a7-264e-470f-9d64-7f9ca59766cd","text":"This paper proposes a new method, interior gradients, for analysing feature importance in deep neural networks.  The interior gradient is the gradient measured on a scaled version of the input.  The integrated gradient is the integral of interior gradients over all scaling factors.  Visualizations comparing integrated gradients with standard gradients on real images input to the Inception CNN show that integrated gradients correspond to an intuitive notion of feature importance.\r\n\r\nWhile motivation and qualitative examples are appealing, the paper lacks a comprehensive evaluation and comparison with existing methods. It would be valuable to provide quantitative results to demonstrate the effectiveness of interior gradients in capturing feature importance across various deep networks. Additionally, the paper's claim that interior gradients can be computed as easily as regular gradients needs further explanation and justification. Providing a step-by-step guide or code snippets would greatly enhance the practical adoption of this method. Overall, while the proposed interior gradients show promise in capturing feature importance, further refinement and validation are needed to fully establish its usefulness.","label":79}
{"id":"cc9841e8-8027-4d74-8669-8b0db62d6af1","text":"The authors propose to measure \u201cfeature importance\u201d, or specifically, which pixels contribute most to a network\u2019s classification of an image. A simple (albeit not particularly effective) heuristic for measuring feature importance is to measure the gradients of the predicted class wrt each pixel in an input image I. This assigns a score to each pixel in I (that ranks how much the output prediction would change if a given pixel were to change). In this paper, the authors build on this and propose to measure feature importance by computing gradients of the output wrt scaled version of the input image, alpha*I, where alpha is a scalar between 0 and 1, then summing across all values of alpha to obtain their feature importance score. Here the scaling is simply linear scaling of the pixel values (alpha=0 is all black image, alpha=1 is original image). The authors call these scaled images \u201ccounterfactuals\u201d which seems like quite an unnecessarily grandiose name for literally, a scaled image. \r\n\r\nThe authors show a number of examples where they apply their method to different deep networks, including the GoogleNet architecture for object recognition, a ligand-based virtual screening network, and an LSTM-based language model. They demonstrate how the interior gradients provide a better capture of feature importance compared to the traditional gradient approach. In their experiments, they visualize the computed feature importance scores and observe that they align more closely with human intuition. The authors also highlight the generality of their method, stating that interior gradients are applicable to a wide range of deep networks and exhibit the desirable property of feature importance scores summing up to the prediction score. This property ensures that the importance scores reflect the overall impact of the features on the network's decision and provides a meaningful interpretation. Another significant advantage of interior gradients is their simplicity of computation, which is comparable to standard gradient calculations. This simplicity is in stark contrast to previous methods that were complex to implement, making interior gradients more accessible and easier to adopt in practical applications. Overall, the authors present a compelling approach to quantifying feature importance in deep networks by introducing the concept of interior gradients. Their method addresses the limitation of traditional gradients in capturing the significance of input features, especially in highly non-linear deep models where saturation can occur. The experimental results and visualizations provided in the paper effectively demonstrate the effectiveness and interpretability of interior gradients. Additionally, the authors emphasize the practicality of their method due to its simplicity and widespread applicability. However, it would be useful to explore further how the proposed approach performs in different domains and tasks and to compare it with other state-of-the-art methods for feature importance quantification. Overall, this paper makes a valuable contribution to the field of studying gradients and provides an innovative solution to the problem of capturing feature importance in deep networks.","label":169}
{"id":"120d6cf5-e8c3-4b83-8600-628d12a090eb","text":"This work proposes to use visualization of gradients to further understand the importance of features (i.e. pixels) for visual classification. Overall, this presented visualizations are interesting, however, the approach is very ad hoc. The authors do not explain why visualizing regular gradients isn't correlated with the importance of features relevant to the given visual category and proceed to the interior gradient approach. The authors provide evidence that regular gradients can lead to misleading feature importance rankings in deep networks due to saturation effects. They argue that visualizing interior gradients, which are constructed by scaling down the original input, can provide a better understanding of feature importance. The authors demonstrate the effectiveness of interior gradients in capturing feature importance by applying their method to several deep networks, including GoogleNet, a ligand-based virtual screening network, and an LSTM language model. They also highlight the practical advantages of interior gradients, as they can be computed as easily as regular gradients and have the attribution property. Overall, the approach of using interior gradients for quantifying feature importance in deep networks seems promising and warrants further investigation.","label":61}
{"id":"54074592-97c2-4573-9d05-db5c22f5124e","text":"This work proposes to use visualization of gradients to further understand the importance of features (i.e. pixels) for visual classification. Overall, this presented visualizations are interesting, however, the approach is very ad hoc and lacks a clear framework. The paper mentions that the method is applicable to a wide variety of deep networks, but it would be valuable to include more examples and experiments to support this claim. Additionally, it would be beneficial to discuss the limitations of using interior gradients and potential areas for future research. Overall, this paper presents an interesting approach to better capture feature importance in deep networks, but it would benefit from further development and a more comprehensive evaluation of its effectiveness across different domains.","label":32}
{"id":"487e02e5-778d-4b12-b8ff-0d6df10475ba","text":"This paper was reviewed by 3 experts. All 3 seem unconvinced of the contributions, point to several shortcomings, and recommend rejection. I see no basis for overturning the decision of the 3 experts and also recommend rejecting the paper due to the lack of convincing contributions and various shortcomings.","label":27}
{"id":"3c8e92d1-679a-495b-83b0-463b45679aee","text":"We added two new sections (2.5 and 2.6) to the paper. Section 2.5\r\nproposes two very desirable axioms for attribution methods,\r\nand uses them to rule out other attribution methods from consideration. In Section 2.6, we present experimental results comparing the performance of our proposed interior gradient method with existing attribution methods on various benchmark datasets. Our results demonstrate that interior gradients consistently outperform other methods in capturing feature importance accurately. Overall, the addition of these two sections strengthens the theoretical foundation and empirical evidence supporting the effectiveness and applicability of our method.","label":29}
{"id":"070fb524-337a-4900-9b6e-8606a16b2b26","text":"We thank the reviewers for a detailed review.  The rebuttal below addresses some of the mentioned concerns.\r\n\r\nRegarding \u201cfar too long\u201d and \u201cunnecessarily grandiose name for literally, a scaled image\u201d: \r\n\r\nWe\u2019d agree that the paper is long for the ideas in it. The length stems from the difficulty of not having a crisp evaluation technique for feature importance. So we try to resort to qualitative discussions together with images. But we  can definitely try to tighten the writing. We are open to changing the title of the paper to \u201cInterior Gradients\u201d or something like it, though it is worth noting that while scaling intensities seems natural for images, analogous scaling for Text or Drug Discovery models results in inputs that are more obviously fake, i.e., counterfactual.\r\n\r\nRegarding \u201chow the proposed scheme for feature importance scores sum to the the prediction score. Additionally, we appreciate the authors' effort in providing a practical method for computing interior gradients. The fact that they can be computed just as easily as gradients is a significant advantage over previous complex methods. However, it would be beneficial if the authors could provide further details on the implementation of the proposed method and any potential limitations or challenges. Overall, the concept of interior gradients and their potential application to a wide variety of deep networks is intriguing, and the visualizations provided in the paper are informative. With some revisions to address the concerns raised by the reviewers, this paper has the potential to make a valuable contribution to the field of feature importance quantification in machine learning models.","label":133}
{"id":"7405ce6d-9f75-4552-a6d0-2d8ca7063483","text":"This paper proposes a new method, interior gradients, for analysing feature importance in deep neural networks.  The interior gradient is the gradient measured on a scaled version of the input.  The integrated gradient is the integral of interior gradients over all scaling factors.  Visualizations comparing integrated gradients with standard gradients on real images input to the GoogleNet architecture for object recognition in images, as well as a ligand-based virtual screening network with categorical features and an LSTM based language model for the Penn Treebank dataset. The paper presents compelling evidence that interior gradients provide a more accurate measure of feature importance compared to standard gradients. The visualizations clearly demonstrate that interior gradients capture important features that are missed by standard gradients, especially in cases where the gradients become saturated. Additionally, the authors highlight the applicability of interior gradients to a wide range of deep networks and emphasize the attribution property, where the feature importance scores sum to the prediction score. This attribute further enhances the interpretability and usefulness of interior gradients. Another notable advantage is that interior gradients can be computed with the same ease as regular gradients, which makes them highly practical for adoption. Overall, this paper makes a valuable contribution to the field of feature importance analysis in deep neural networks.","label":59}
{"id":"1eecad04-f265-440c-9057-c1d777585c8e","text":"The authors propose to measure \u201cfeature importance\u201d, or specifically, which pixels contribute most to a network\u2019s classification of an image. A simple (albeit not particularly effective) heuristic for measuring feature importance is to measure the gradients of the predicted class wrt each pixel in an input image I. This assigns a score to each pixel in I (that ranks how much the output prediction would change if a given pixel were to change). In this paper, the authors build on this and propose to measure feature importance by computing gradients of the output wrt scaled version of the input image, alpha*I, where alpha is a scalar between 0 and 1, then summing across all values of alpha to obtain their feature importance score. Here the scaling is simply linear scaling of the pixel values (alpha=0 is all black image, alpha=1 is original image). The authors call these scaled images \u201ccounterfactuals\u201d which seems like quite an unnecessarily grandiose name for literally, a scaled image. \r\n\r\nThe authors show a number of visualizations that indicate that the proposed feature importance score is more reasonable than just looking at gradients only with respect to the original image. They also show some quantitative evidence that the pixels highlighted by the proposed measure are more likely to fall on the objects rather than the background. Additionally, the authors compare their method with previous approaches that measure feature importance and demonstrate that interior gradients outperform these methods in terms of capturing the true importance of features. This is a significant contribution to the field as it provides a simpler and more effective way to quantify feature importance in deep networks.\r\n\r\nAnother strength of the proposed method is its applicability to a wide range of deep networks. The authors apply their method not only to the GoogleNet architecture for object recognition, but also to a ligand-based virtual screening network with categorical features and an LSTM based language model for the Penn Treebank dataset. The fact that interior gradients can be computed just as easily as gradients is a major advantage over previous complex methods, as it facilitates practical adoption and implementation.\r\n\r\nIn conclusion, the paper presents a novel approach for quantifying feature importance in deep networks through the use of interior gradients. The authors provide both visual and quantitative evidence to support the effectiveness of their method, showcasing its ability to capture feature importance in a more reasonable and accurate manner compared to traditional gradient-based approaches. The proposal of scaled images as counterfactuals might seem trivial, but the results speak for themselves, highlighting the potential impact of this simple yet powerful technique. With its easy implementation and applicability to various deep networks, interior gradients have the potential to significantly advance the field of feature importance analysis in machine learning.\r\n\r\n","label":217}
{"id":"1035ce4a-e055-4f93-be13-d30d182288a3","text":"This work proposes to use visualization of gradients to further understand the importance of features (i.e. pixels) for visual classification..The paper makes an interesting contribution by introducing the concept of interior gradients, which are gradients of counterfactual inputs obtained by scaling down the original input. By examining these interior gradients, the authors aim to address the issue of feature saturation in nonlinear deep networks. They conduct experiments on various networks including the GoogleNet architecture, a ligand-based virtual screening network, and an LSTM based language model. The visualization of interior gradients provides a better understanding of feature importance, and the method is applicable to a wide range of deep networks. Notably, the attribution property of interior gradients ensures that the feature importance scores sum to the prediction score. Moreover, the simplicity of computing interior gradients compared to previous methods makes it more practical for adoption. Overall, the paper presents a valuable approach for quantifying feature importance in machine learning models.","label":20}
{"id":"eb1fa7fe-8b74-468c-b2eb-e3a9e9d4c4d8","text":"The starting point of this work is the understanding that by having decorrelated neurons (e.g. neurons that only fire on background, or only on foreground regions) one provides independent pieces of information to the subsequent decisions.. This is a crucial aspect in deep convolutional neural networks (CNNs) as it leads to improved generalization ability and better feature representation learning. The proposed approach in this paper, called the group orthogonal convolutional neural network (GoCNN), addresses this by incorporating auxiliary annotations as privileged information, which is used to optimize the model and enhance feature diversity within a single model. By exploiting the inherent diversity of the CNN model, the GoCNN is able to maximize the utilization of privileged information and effectively learn features from both foreground and background regions. The experiments conducted on benchmark datasets, ImageNet and PASCAL VOC, validate the effectiveness and high generalization ability of the proposed GoCNN models. Overall, this paper introduces a novel approach that combines privileged information and feature diversity to improve the performance of CNNs and contributes to the advancement of deep learning techniques for image classification tasks.","label":36}
{"id":"34bb48c8-0a81-4603-8c51-866aa90b6f1f","text":"This paper was reviewed by three experts. While  they found the idea of incorporating privileged information into a CNN model intriguing, they were concerned about the practicality of implementing the proposed group orthogonal convolutional neural network (GoCNN). They also suggested conducting more experiments on additional datasets to further validate the effectiveness and generalization ability of the proposed approach.","label":8}
{"id":"db0339cc-91e8-44f9-9823-8407f7e5aadb","text":".In this paper, the authors propose a novel CNN model called group orthogonal convolutional neural network (GoCNN) that leverages privileged information as auxiliary annotations to maximize the diversity of a CNN model. This approach aims to improve the feature representation and enhance the generalization ability of the model. Experimental results on ImageNet and PASCAL VOC datasets demonstrate the effectiveness of the proposed GoCNN models. Overall, the paper presents an interesting approach for incorporating privileged information into CNN training and contributes to the advancement of feature learning in deep convolutional neural networks.","label":0}
{"id":"c8b662a0-5a84-45d8-a8d3-798bdc99614f","text":"This paper proposes to learn groups of orthogonal features in a convnet by penalizing correlation among features in each group.  The technique is applied in the setting of image classification with \u201cprivileged information\u201d in the form of foreground segmentation masks, where the model is trained to learn orthogonal groups of foreground and background features using the correlation between the different groups. The authors propose a novel model called Group Orthogonal Convolutional Neural Network (GoCNN) to achieve this goal. The GoCNN model takes advantage of the privileged information, which is the foreground segmentation masks, to highlight the diversity of features within a single model.\r\n\r\nThe idea of incorporating privileged information into the training process is a significant contribution of this work. Rather than disregarding the auxiliary annotations, the authors leverage them to enhance the learning of diverse and rich feature representations. By considering foreground and background as separate groups and encouraging orthogonality within each group, the GoCNN model enables the network to capture different aspects of the input image, leading to a stronger generalization ability.\r\n\r\nThe experiments conducted in the paper provide convincing evidence of the effectiveness of the proposed GoCNN models. The authors validate their approach on two standard benchmark datasets: ImageNet and PASCAL VOC. By comparing the performance of GoCNN with other state-of-the-art methods, they demonstrate that their model achieves competitive results in terms of accuracy and generalization.\r\n\r\nThe results on ImageNet show that GoCNN outperforms the baseline model by a significant margin. The analysis of the learned features reveals that the foreground and background features captured by GoCNN are indeed orthogonal, leading to a better discriminative representation. Similarly, the experiments on PASCAL VOC demonstrate the superiority of GoCNN in learning diverse features for object recognition and segmentation tasks. The authors provide detailed ablation studies, demonstrating the importance of privileged information and the effectiveness of the proposed group orthogonal learning.\r\n\r\nFurthermore, the paper discusses related works on feature learning, privileged information, and orthogonal learning. The authors provide a comprehensive review of the existing approaches and highlight the advantages of their proposed method over these techniques. The paper also provides clear explanations of the mathematical formulations used to enforce orthogonality and how the privileged information is incorporated into the learning process.\r\n\r\nOverall, the proposed GoCNN model for learning feature representation with privileged information is a novel and effective contribution to the field of deep convolutional neural networks. The emphasis on diversity and orthogonality within the network enables better generalization and improved performance in object recognition and segmentation tasks. The experimental results on standard benchmark datasets support the claims made by the authors. The paper is well-written, clear, and provides sufficient details to replicate the experiments and understand the proposed model. I recommend accepting this paper for publication.","label":58}
{"id":"6788936a-e6c9-411e-b8d7-8a356cef9402","text":"This paper proposes a modification to ConvNet training so that the feature activations before the linear classifier are divided into groups such that all pairs of features across all pairs of groups are encouraged to have low statistical correlation. Instead of discovering the groups automatically, the work proposes to use supervision, which they call privileged information, to assign features to groups in a hand-coded fashion..The motivation behind this approach is to maximize the diversity of feature representation in convolutional neural networks (CNNs). By incorporating auxiliary annotations as privileged information, the proposed method aims to leverage this additional knowledge to enhance the generalization ability of the CNN model. The authors introduce a group orthogonal CNN (GoCNN) model, which ensures that the foreground and background features are learned in an orthogonal manner. This orthogonality is achieved by exploiting the privileged information during the optimization process. By emphasizing feature diversity within a single model, the GoCNN model aims to learn richer and more diverse feature representations.\r\n\r\nOne of the strengths of this approach is that it allows the model to leverage auxiliary annotations effectively. Instead of completely ignoring the privileged information, the proposed method incorporates it into the training process to guide the feature grouping. This is achieved by hand-coding the assignment of features to groups based on the privileged information. By doing so, the model can learn more distinctive features that capture both foreground and background information effectively.\r\n\r\nThe proposed GoCNN model is evaluated on two benchmark datasets, namely ImageNet and PASCAL VOC. The experimental results demonstrate the effectiveness and high generalization ability of the proposed approach. The GoCNN model achieves superior performance compared to baseline models in terms of classification accuracy and generalization across different datasets. This indicates that the incorporation of privileged information and the emphasis on feature diversity contribute to better feature representation and improved generalization capability.\r\n\r\nAlthough the proposed method shows promising results, there are a few aspects that could be further explored. For instance, it would be interesting to investigate the automatic discovery of feature groups, rather than relying on hand-coded assignment based on privileged information. Additionally, further analysis on the impact of different types of privileged information on the feature grouping process could provide insights into the versatility and adaptability of the proposed approach.\r\n\r\nIn conclusion, this paper presents a novel approach to training CNNs called GoCNN, which leverages privileged information to enhance feature diversity. The experimental results on benchmark datasets demonstrate the effectiveness of the proposed approach in learning richer feature representations and achieving high generalization ability. This work contributes to the field of deep learning by highlighting the importance of incorporating auxiliary annotations and emphasizing feature diversity in CNN models.","label":65}
{"id":"bf0b1595-ba4e-49b3-98ac-d21a76dc44ac","text":"The starting point of this work is the understanding that by having decorrelated neurons (e.g. neurons that only fire on background, or only on foreground regions) one provides independent pieces of information to the subsequent decisions.. This observation is well-motivated, as diverse and rich feature representations are highly desired for deep convolutional neural networks (CNNs). The authors recognize that simply ignoring auxiliary annotations, which can provide valuable information, would be a wasted opportunity. To address this, they propose a novel CNN model called group orthogonal convolutional neural network (GoCNN) that leverages privileged information to maximize the inherent diversity of the model, leading to improved feature representation and stronger generalization ability. By learning features from foreground and background regions in an orthogonal way, the proposed GoCNN automatically emphasizes feature diversity within a single model. This idea is supported by experiments conducted on two benchmark datasets, namely ImageNet and PASCAL VOC, which demonstrate the effectiveness and high generalization ability of the proposed GoCNN models. Overall, the paper presents an interesting approach that effectively incorporates privileged information to enhance the diversity and generalization performance of CNN models.","label":36}
{"id":"c944131d-4315-4f94-b660-595dcc99c1e7","text":"The starting point of this work is the understanding that by having decorrelated neurons (e.g. neurons that only fire on background, or only on foreground regions) one provides independent pieces of information to the subsequent decisions..The authors of this paper propose a novel CNN model called Group Orthogonal Convolutional Neural Network (GoCNN) to incorporate auxiliary annotations as privileged information and maximize the diversity of the feature representation. The motivation behind this work is the recognition that decorrelated neurons, which respond to either foreground or background regions, provide independent and valuable information for decision making. By exploiting privileged information, the proposed GoCNN model ensures that the foreground and background features are learned in an orthogonal manner, emphasizing feature diversity within a single model. The experiments conducted on two benchmark datasets, ImageNet and PASCAL VOC, provide strong evidence of the effectiveness and high generalization ability of the proposed GoCNN models. This work contributes to addressing the issue of wasted auxiliary annotations and enables the learning of richer and more diverse feature representations. Overall, the paper presents a novel approach to incorporate privileged information into CNN models and demonstrates its efficacy through extensive experiments. However, it would be beneficial if the authors provided more insights into the computational complexity and scalability of the proposed approach, as well as a deeper analysis of the interpretability of the learned features. Additionally, more thorough comparisons with existing methods would further strengthen the paper's contributions.","label":36}
{"id":"532b77c9-1f31-4ee1-979f-82d6542d8300","text":"We've already updated the paper. \r\n- The abstract and introduction have been rewritten with more explanation (on the motivation) and comparison. \r\n- The difference from ensemble models was highlighted in this paper. The methodology and experimental results were well-described, showcasing the effectiveness and high generalization ability of the proposed GoCNN models. However, I would suggest providing more details on the specific benchmark datasets used and how the privileged information was incorporated for optimization. Overall, this is a valuable contribution to the field of deep learning.","label":30}
{"id":"4a0b2fb6-718f-48db-b56d-8730a0f7e8d4","text":"This paper proposes to learn groups of orthogonal features in a convnet by penalizing correlation among features in each group.  The technique is applied in the setting of image classification with \u201cprivileged information\u201d in the form of foreground segmentation masks, where the model is trained to learn orthogonal groups of foreground and background features using the correlation penalty and an additional \u201cbackground suppression\u201d term.\r\n\r\n\r\nPros:\r\n\r\nProposes a \u201cgroup-wise model diversity\u201d loss term which is novel, to my knowledge.\r\n\r\nThe use of foreground segmentation masks to improve image classification is also novel.\r\n\r\nThe method is evaluated on two standard and relatively large-scale vision datasets: ImageNet and PASCAL VOC 2012.\r\n\r\n\r\nCons:\r\n\r\nThe evaluation is lacking.  There should be a baseline that leaves out the background suppression term, so readers know how much that term is contributing to the performance vs. the group orthogonal term.  The use of the background suppression term is also confusing to me -- it seems redundant, as the group orthogonality term should be sufficient to suppress the background features. Additionally, the evaluation could benefit from a more in-depth analysis of the impact of the privileged information on the model's performance. While the paper claims the effectiveness and high generalization ability of the proposed GoCNN models, it would strengthen the findings to include a comparison with other state-of-the-art methods in the literature. This would provide a better understanding of the relative performance of the proposed approach. Furthermore, the authors could consider conducting ablation studies to investigate the importance of each component of their method. This would help identify whether the foreground segmentation masks or the group-wise model diversity loss term contribute more significantly to the improvements in performance. Another aspect that could be addressed is the robustness of the proposed approach to variations in the privileged information. Are the results consistent across different foreground masks or do certain types of masks lead to better performance? Providing insights into these questions would enhance the understanding of the strengths and limitations of the proposed technique. Moreover, the paper would benefit from a clearer explanation of the intuition behind using orthogonal groups of features and how it promotes feature diversity. While the concept is briefly mentioned in the introduction, a more detailed discussion would help readers grasp the underlying principles and motivations of the proposed GoCNN model. Overall, this paper presents an interesting approach for incorporating privileged information into convolutional neural networks and leveraging it to learn diverse feature representations. The proposed group orthogonal model shows promise in improving the generalization ability of CNNs. With additional experiments and analyses to support the claims, this work could make a significant contribution to the field of deep learning for image classification.","label":161}
{"id":"2ae22ccf-5aea-4d17-98f3-c2b0ebdb956e","text":"This paper proposes a modification to ConvNet training so that the feature activations before the linear classifier are divided into groups such that all pairs of features across all pairs of groups are encouraged to have low statistical correlation. Instead of discovering the groups automatically, the work proposes to use supervision, which they call privileged information, to assign features to groups in a hand-coded fashion. The developed method is applied to image classification.\r\n\r\nPros:\r\n- The paper is clear and easy to follow\r\n- The experimental results seem to show some benefit from the proposed approach\r\n\r\nCons:\r\n(1) The paper proposes one core idea (group orthogonality w\/ privileged information), but then introduces background feature suppression without much motivation and without careful experimentation\r\n(2) No comparison with an ensemble\r\n(3) Full experiments on ImageNet under the \"partial privileged information\" setting would be helpful to understand the performance of the proposed GoCNN model in a large-scale dataset. Additionally, the paper could benefit from a more comprehensive analysis of the computational complexity and training time of the proposed method compared to traditional CNN models. A limitation of the paper is the hand-coded assignment of features to groups, which may not be feasible in scenarios with large-scale datasets or when the privileged information is not available. It would be interesting to explore automatic methods for assigning features to groups based on unsupervised learning or reinforcement learning techniques. Overall, the paper presents an interesting approach for incorporating privileged information into CNN training and maximizing feature diversity within a model. The experimental results demonstrate the potential benefits of the proposed GoCNN model in improving feature representation and generalization. Addressing the mentioned limitations and conducting further experiments would strengthen the paper and provide a more comprehensive evaluation of the proposed method's effectiveness compared to existing techniques.","label":133}
{"id":"fc64d44e-5c25-40c5-9813-85b98e54d803","text":"The starting point of this work is the understanding that by having decorrelated neurons (e.g. neurons that only fire on background, or only on foreground regions) one provides independent pieces of information to the subsequent decisions..The authors of this paper propose a novel CNN model called the group orthogonal convolutional neural network (GoCNN) to learn features from both foreground and background in an orthogonal manner. They acknowledge the importance of diverse feature representation in deep CNNs and argue that ignoring available auxiliary annotations would be a waste. To address this, they incorporate privileged information as auxiliary annotations into the GoCNN model, aiming to maximize the inherent diversity of the model and improve its generalization ability. By exploiting privileged information for optimization, the proposed GoCNN model emphasizes feature diversity within a single model. The authors conduct experiments on two benchmark datasets, namely ImageNet and PASCAL VOC, to evaluate the effectiveness and generalization ability of their proposed models. The experimental results demonstrate that the GoCNN models achieve remarkable performance, highlighting the benefits of incorporating privileged information for feature learning. Overall, this work presents a promising approach to enhancing feature representation in deep CNNs by leveraging privileged information and emphasizing diversity within the model.","label":36}
{"id":"e5426816-54d3-4b98-a478-4f5126512116","text":"This paper proposes an idea of looking n-steps backward when modelling sequences with RNNs. The proposed RNN does not only use the previous hidden state (t-1) but also looks further back ( (t - k) steps, where k=1,2,3,4 ). The paper also proposes a few different ways to aggregate multiple hidden states from the past.\r\n\r\n\r\nThe reviewer can see few issues with this paper.\r\n\r\nFirstly, the writing of this paper requires improvement. The introduction and motivation sections are not clear and do not provide sufficient background information to understand the novelty of the proposed approach. The authors briefly mention that the proposed higher order RNNs (HORNNs) better model long-term dependencies in sequential data, but they do not clearly explain why existing RNN architectures, such as LSTMs, are insufficient in capturing such dependencies. Providing a more detailed comparison with existing models and highlighting the limitations of current approaches would strengthen the motivation for this work.\r\n\r\nAnother concern is the lack of experimental details in the paper. While the authors claim to have conducted experiments on two popular datasets, namely the Penn Treebank and English text8, they do not provide any information about the experimental setup, such as the hyperparameters used, the size of the training set, or the evaluation metrics employed. Without these details, it is difficult to reproduce the experiments and assess the robustness of the proposed approach.\r\n\r\nFurthermore, the paper lacks a thorough analysis of the proposed HORNNs. The authors mention that the HORNNs yield state-of-the-art performance on the datasets, outperforming regular RNNs and LSTMs. However, it would be beneficial to provide additional insights into why the HORNNs perform better. Is it due to their ability to capture longer-term dependencies, or do they have other advantages over existing models? Including a detailed analysis and comparison with baselines would provide a more comprehensive evaluation of the proposed approach.\r\n\r\nOverall, while the idea of using higher order RNNs to model long-term dependencies in sequential data is interesting, the paper needs significant improvements. The writing should be revised to provide clearer explanations of the motivation and novelty of the proposed approach. Additionally, more experimental details and a thorough analysis of the results are necessary to support the claims made by the authors. With these revisions, the paper has the potential to make a valuable contribution to the field of sequence modeling with neural networks.","label":73}
{"id":"03579e07-e85b-495d-9631-4752fa51ae8e","text":"Paper presents the idea of using higher order recurrence in LSTMs. The ideas are well presented and easy to follow.\r\n However, the results are far from convincing, easily being below well established numbers in the domain. Since the mode is but a preliminary study, further experiments need to be conducted in order to validate the claims made in the paper. Additionally, the paper could benefit from a more detailed analysis and explanation of the proposed higher order RNN structures. Despite these limitations, the idea of using more memory units to improve short-term memory mechanism in RNNs is promising and could potentially lead to advancements in sequence modeling tasks. Overall, this paper provides a good foundation for future research in the area of higher order recurrent neural networks.","label":42}
{"id":"e36318db-1324-439a-8b00-c7d1ec02668c","text":"The authors of the paper explore the idea of incorporating skip connections *over time* for RNNs. Even though the basic idea is not particularly innovative, a few proposals on how to merge that information into the current hidden state with different pooling functions are evaluated. The different models are compared on two popular text benchmarks.\r\n\r\nSome points.\r\n\r\n1) The experiments feature only a subset of the benchmark datasets, which may limit the generalizability of the findings. It would be valuable to see the performance of the proposed HORNNs on a wider range of tasks and datasets. Additionally, the paper does not provide a thorough analysis of the computational complexity of the proposed models compared to regular RNNs and LSTMs. This information would be useful for understanding the feasibility of implementing HORNNs in practical applications. Despite these limitations, the results of the experiments are impressive, with the proposed HORNNs consistently outperforming regular RNNs and LSTMs on the language modeling task. The state-of-the-art performance achieved on both the Penn Treebank and English text8 datasets showcases the potential of HORNNs for improving sequence modeling. Overall, this paper presents an interesting exploration of higher order RNNs and their application to long term dependency modeling, and it opens up avenues for further research in this area.","label":60}
{"id":"6e2ddbdc-8079-4bba-bfee-2ab3925df424","text":"I think the backbone of the paper is interesting and could lead to something potentially quite useful. I like the idea of connecting signal processing with recurrent network and then using tools from one setting in the other. However, while the work has nuggets of very interesting observations, I feel they can be put together in a more coherent and systematic manner. The paper would benefit from providing more detailed explanations of the proposed higher order RNN structures and their advantages over regular RNNs and LSTMs. Additionally, the authors should include more information about the experimental setup, such as the hyperparameters used and the evaluation metrics employed. Overall, the paper shows promise, but it needs further development to fully convince the reader of the superiority of HORNNs in sequence modeling tasks.","label":57}
{"id":"ef57955c-0ad6-4ccc-b52b-66e9af1ec9c2","text":"This paper proposes an idea of looking n-steps backward when modelling sequences with RNNs. The proposed RNN does not only use the previous hidden state (t-1) but also looks further back ( (t - k) steps, where k=1,2,3,4 ). The paper also proposes a few different ways to aggregate multiple hidden states from the past.\r\n\r\n\r\nThe reviewer can see few issues with this paper.\r\n\r\nFirstly, the writing of this paper requires improvement. The introduction and abstract are wasting too much space just to explain unrelated facts or to describe already well-known things in the literature. Some of the statements written in the paper are misleading. For instance, it explains, \u201cAmong various neural network models, recurrent neural networks (RNNs) are appealing for modeling sequential data because they can capture long term dependency in sequential data using a simple mechanism of recurrent feedback\u201d and then it says RNNs cannot actually capture long-term dependencies that well..Furthermore, the experimental evaluation of the proposed HORNNs for language modeling is limited to only two datasets, the Penn Treebank (PTB) and English text8. While the results on these datasets show significant improvement over regular RNNs and LSTMs, it would be beneficial to test HORNNs on a wider range of sequential datasets to assess their generalizability. Additionally, the paper does not provide a detailed analysis of the computational complexity of HORNNs compared to other models. This information would be valuable to determine the feasibility and efficiency of implementing HORNNs in real-world applications.\r\n\r\nIn conclusion, the paper presents a compelling idea of using higher order recurrent neural networks (HORNNs) to capture long-term dependencies in sequential data. The proposal of looking n-steps backward in addition to the previous hidden states shows promise in improving the short-term memory mechanism of RNNs. The experimental results on the Penn Treebank and English text8 datasets demonstrate the superior performance of HORNNs compared to regular RNNs and LSTMs for language modeling. However, there are certain areas that require improvement, such as the clarity and conciseness of the writing, the inclusion of more diverse datasets for evaluation, and the analysis of computational complexity. Addressing these issues would enhance the overall quality and impact of the paper.","label":151}
{"id":"bc5891ee-a745-491c-977a-d342d5529237","text":"This paper proposes an idea of looking n-steps backward when modelling sequences with RNNs. The proposed RNN does not only use the previous hidden state (t-1) but also looks further back ( (t - k) steps, where k=1,2,3,4 ). The paper also proposes a few different ways to aggregate multiple hidden states from the past.\r\n\r\n\r\nThe reviewer can see few issues with this paper.\r\n\r\nFirstly, the writing of this paper requires improvement. The introduction and abstract are wasting too much space just to explain unrelated facts or to describe already well-known things in the literature. Some of the statements written in the paper are misleading. For instance, it explains, \u201cAmong various neural network models, recurrent neural networks (RNNs) are appealing for modeling sequential data because they can capture long term dependency in sequential data using a simple mechanism of recurrent feedback\u201d and then it says RNNs cannot actually capture long-term dependencies that well. RNNs are appealing in the first place because they can handle variable length sequences and can model temporal relationships between each symbol in a sequence. The criticism against LSTMs is hard to accept when it says: LSTMs are slow and because of the slowness, they are hard to scale at larger tasks. But we all know that some companies are already using LSTMs for large-scale tasks, indicating that their scalability issue is not as severe as claimed. Moreover, it is not clear how the proposed higher order RNNs (HORNNs) address the problem of scalability. The paper lacks a clear explanation of the mechanisms that enable HORNNs to capture long-term dependencies more effectively than traditional RNNs or LSTMs.\r\n\r\nAdditionally, the experimental evaluation presented in the paper needs to be more comprehensive. The authors only evaluate the performance of HORNNs on language modeling tasks using the Penn Treebank and English text8 datasets. It would be beneficial if they tested the proposed models on a wider range of sequence modeling tasks to validate the general applicability of HORNNs. The comparison with other models is also limited to regular RNNs and LSTMs. It would be more informative if the authors included a comparison with other well-established models in the field, such as Gated Recurrent Units (GRUs) or Transformer models.\r\n\r\nFurthermore, the authors should provide more details on the training procedure and hyperparameter settings. It is not specified how the models were trained, what optimization algorithm was used, or how the hyperparameters were chosen. This lack of information makes it difficult to reproduce the results and assess the robustness of the proposed approach.\r\n\r\nIn summary, while the idea of higher order RNNs is interesting and has the potential to improve modeling of long-term dependencies, this paper needs significant improvements in terms of writing clarity, experimental evaluation, and providing more detailed information about the proposed method. Addressing these concerns would greatly enhance the quality and impact of the paper.","label":213}
{"id":"e90d9b60-0852-47db-aa68-1b0c900fee88","text":"Paper presents the idea of using higher order recurrence in LSTMs..The idea of using higher order recurrence in LSTMs presented in the paper is an interesting approach to improve the modeling of long term dependency in sequential data. By incorporating more memory units to keep track of preceding states in recurrent neural networks (RNNs), the proposed higher order RNNs (HORNNs) aim to enhance the short-term memory mechanism and better learn long term dependency in sequences. The paper highlights how HORNNs, similar to RNNs, can be learned using the back-propagation through time method, making them applicable to various sequence modeling tasks. The experimental results showcased in the paper are particularly impressive, demonstrating that HORNNs outperform regular RNNs and even popular LSTMs in terms of language modeling performance on the Penn Treebank (PTB) and English text8 datasets. Overall, the paper presents a promising advancement in the field of sequential data modeling with higher order RNN structures.","label":11}
{"id":"bb8b431f-50fe-402b-b516-93f1afb842d3","text":"The authors of the paper explore the idea of incorporating skip connections *over time* for RNNs. Even though the basic idea is not particularly innovative, a few proposals on how to merge that information into the current hidden state with different pooling functions are evaluated. The different models are compared on the Penn Treebank (PTB) and English text8 datasets, and the results show that the proposed HORNNs achieve state-of-the-art performance on both datasets. The experimental comparison with regular RNNs and LSTMs indicates that HORNNs consistently outperform these models. This demonstrates the effectiveness of incorporating skip connections over time in RNNs for improving the modeling of long-term dependencies in sequential data. The paper is well-written and provides a clear explanation of the proposed HORNN architecture and its advantages over traditional RNNs. The experiments are thorough and the results are presented in a concise and easy-to-understand manner. However, there are some areas that could be further improved. Firstly, the authors should provide more insights into the choice of pooling functions used in the experiments and explain their impact on the performance of the HORNN models. Additionally, more discussion on the limitations of the HORNN approach and potential future directions for research would be beneficial.","label":51}
{"id":"fc887d79-d2dd-42a7-8ef8-1b55db1e0f98","text":"I think the backbone of the paper is interesting and could lead to something potentially quite useful. I like the idea of connecting signal processing with recurrent network and then using tools from one setting in the other. However, while the work has nuggets of very interesting observations, I feel they can be put together in a better way. However, the paper could benefit from a more organized structure that clearly presents the motivations, methods, and results of the proposed higher order RNNs. Additionally, providing more detailed comparisons with existing models such as LSTM and GRU, and discussing the limitations of the proposed HORNNs would enhance the overall contribution of the paper. Overall, with minor revisions and expansions, this paper has the potential to make a significant impact in the field of sequence modeling.","label":59}
{"id":"822ec136-8632-44be-b183-30928785ec47","text":"This paper proposes an idea of looking n-steps backward when modelling sequences with RNNs. The proposed RNN does not only use the previous hidden state (t-1) but also looks further back ( (t - k) steps, where k=1,2,3,4 ). The paper also proposes a few different ways to aggregate multiple hidden states from the past.\r\n\r\n\r\nThe reviewer can see few issues with this paper.\r\n\r\nFirstly, the writing of this paper requires improvement. The introduction and abstract are wasting too much space just to explain unrelated facts or to describe already well-known things in the literature. Some of the statements written in the paper are misleading. For instance, it explains, \u201cAmong various neural network models, recurrent neural networks (RNNs) are appealing for modeling sequential data because they can capture long term dependency in sequential data using a simple mechanism of recurrent feedback\u201d and then it says RNNs cannot actually capture long-term dependencies that well. RNNs are appealing in the first place because they can handle variable length sequences and can model temporal relationships between each symbol in a sequence. The criticism against LSTMs is hard The criticism against LSTMs is hard to understand without a clear comparison between the proposed HORNNs and LSTMs in terms of their ability to capture long-term dependencies. Additionally, the experiment results presented in the paper need to be more comprehensive and rigorous. The authors only evaluate the proposed HORNNs on language modeling tasks using two specific datasets, namely the Penn Treebank and English text8. It would be beneficial to include experiments on other sequence modeling tasks and datasets to validate the general applicability of HORNNs. Furthermore, the paper lacks a discussion on the computational efficiency of HORNNs compared to traditional RNNs and LSTMs. Training and evaluating deeper HORNN architectures may require significantly more computational resources, which should be considered and discussed. Finally, the paper would benefit from a more thorough explanation and analysis of the different ways to aggregate multiple hidden states from the past. Additional experiments could be conducted to explore the impact of different aggregation methods on the performance of HORNNs. Overall, while the idea of using more memory units to model long-term dependencies is interesting, the paper could be improved with clearer writing, more comprehensive experimentation, and further analysis of the proposed HORNNs in relation to other recurrent neural networks.","label":182}
{"id":"9b288828-8682-4817-8445-fb58bc97f07a","text":"This is a well-conducted and well-written study on the prediction of medication from diagnostic codes..The study addresses an important issue in electronic medical records, namely the omission of active medications and the need for computational tools to suggest missing or incorrect medications. The use of recurrent neural networks, specifically the GRU model, achieved high prediction accuracy with a micro-averaged AUC of 0.93 and Label Ranking Loss of 0.076. The findings also highlight the potential of these models to identify errors and omissions in the data, which can lead to improvements in medication tracking. Overall, this study is well-conducted, well-written, and contributes valuable insights to the field of predictive medicine.","label":15}
{"id":"3a154f06-2b14-447c-9e4e-4f9c25650419","text":"This paper applies RNNs to predict medications from billing costs. While this paper does not have technical novelty, it is well done and well organized..The paper provides a significant contribution in addressing the issue of omitted medications and billing code errors in electronic medical records. By utilizing recurrent neural networks, the authors were able to achieve high prediction accuracy for therapeutic classes of medications. Although the paper lacks technical novelty, it is well-executed, showcasing a clear and organized structure. Furthermore, the examination of individual cases lends support to the potential of these models to assist in correcting errors and omissions in the data. Overall, this paper presents a valuable approach to reducing manual efforts in maintaining accurate medication lists, thus benefiting the healthcare industry.","label":25}
{"id":"3ec8d759-a319-4821-9e76-9e4957798539","text":"This is a well written, organized, and presented paper that I enjoyed reading.  I commend the authors on their attention to the narrative and the explanations.  While it did not present any new methodology or architecture, it instead addressed an important application of predicting the medications a patient is using, given the record of billing codes.  The dataset they use is impressive and useful and, frankly, more interesting than the typical toy datasets in machine learning.  That said, the investigation of those results was not as deep as I thought it should have been in an empirical\/applications paper.  Despite their focus on the application, I was encouraged to see the authors use cutting edge choices (eg Keras,.TensorFlow) for their modeling framework. The authors performed a thorough evaluation of their approach, using multiple evaluation metrics such as micro-averaged AUC and Label Ranking Loss. These metrics indicate the predictive accuracy and the ability of the model to rank the medications correctly, respectively. The reported results demonstrate that the proposed GRU model achieved high prediction accuracy with a micro-averaged AUC of 0.93 and a Label Ranking Loss of 0.076. However, the authors acknowledged the limitation of their model size due to hardware constraints, which could potentially be addressed in future work. One aspect that I found particularly interesting in this paper is the examination of individual cases, which revealed that some of the predicted incorrect medications were likely examples of omitted medications or billing codes. This supports the authors' claim of substantial errors and omissions in the data, further highlighting the potential of models like these to help correct such errors. The paper also emphasized the practical implication of reducing the tedious human labor involved in maintaining medication lists by leveraging computational tools. Overall, I believe this paper presents a valuable contribution to the field by addressing an important application in the healthcare domain. The authors' use of a real-world dataset and the robust evaluation of their approach contribute to the reliability and generalizability of their findings. I appreciate the authors' efforts in presenting a well-written and organized paper that effectively communicates the motivation, methodology, and results of their study. As a reader, I would have liked to see a deeper analysis of the obtained results and further discussion on the limitations and potential future directions of the proposed approach. Nevertheless, I believe this work provides a solid foundation for future research in the field of medication prediction from diagnostic codes using recurrent neural networks.","label":122}
{"id":"f7f76d4d-90f4-45aa-bb4b-80dd12983350","text":"In light of the detailed author responses and further updates to the manuscript, I am raising my score to an 8 and reiterating my support for this paper. I think it will be among the strongest non-traditional applied deep learning work at ICLR and will receive a great deal of interest and attention from attendees.\r\n\r\n-----\r\n\r\nThis paper describes modern deep learning approach to the problem of predicting the medications taken by a patient during a period of time based solely upon the sequence of ICD-9 codes assigned to the patient during that same time period. This problem is formulated as a multilabel sequence classification (in contrast to language modeling, which is multiclass classification). They propose to use standard LSTM and GRU architectures with embedding layers to handle the sparse categorical inputs, similar to that described in related work by Choi, et al. In experiments using a cohort of ~610K patient records, they find that RNN models outperform strong baselines including an MLP and a random forest, as well as a common sense baseline. The differences in performance between the recurrent models and the MLP appear to be large enough to be significant, given the size of the test set.\r\n\r\nStrengths:\r\n- Very important problem. As the authors point out, two the value propositions of EHRs -- which have been widely adopted throughout the US due to a combination of legislation and billions of dollars in incentives from the federal government -- included more accurate records and fewer medication mistakes. These two benefits have largely failed to materialize. This seems like a major opportunity for data mining and machine learning.\r\n- Paper is well-written with lucid introduction and motivation, thorough discussion of related work, clear description of experiments and metrics, and interesting qualitative analysis of results.\r\n- Empirical results are solid with a strong win for RNNs over convincing baselines. This is in contrast to some recent related papers, including Lipton & Kale et al, ICLR 2016, where the gap between the RNN and MLP was relatively small, and Choi et al, MLHC 2016, which omitted many obvious baselines.\r\n- Discussion is thorough and thoughtful. The authors are right about the thoroughness of the discussion and the importance of addressing the issue of errors and omissions in electronic medical records. The authors highlight the potential impact of their approach in reducing the manual labor involved in maintaining medication lists, which can lead to more accurate records and fewer medication mistakes. This paper not only presents a novel deep learning approach to predicting medications from diagnostic codes but also contributes to the broader field of data mining and machine learning in healthcare. There are a few aspects that could further strengthen the paper. Firstly, it would be beneficial to provide more details about the dataset used, such as the demographic characteristics of the patients and any potential biases or limitations. This would help to better understand the generalizability of the findings and potential implications for different patient populations. Secondly, it would be valuable to include a comprehensive comparison with other relevant studies in the field. The authors mention a few related works, but a more detailed comparison and discussion of the similarities and differences in approaches, datasets, and results would enhance the paper's contribution to the literature. Additionally, the authors mention hardware constraints on model size limiting the performance of their best model. It would be interesting to explore potential solutions or optimizations to overcome these limitations and improve the performance even further. The authors could also consider discussing the computational requirements of their models and any potential scalability challenges when applying them to larger datasets or real-world healthcare systems. Overall, this paper presents a strong contribution to the field of predicting medications from diagnostic codes using recurrent neural networks. The importance and relevance of this problem are well-motivated, and the empirical results demonstrate the superiority of the proposed approach over convincing baselines. The thoroughness of the discussion and the potential impact on improving the accuracy of medication records make this work highly valuable to both the research community and the healthcare industry. With some minor additions and clarifications, this paper has the potential to be even more impactful and influential in the field. I highly recommend accepting this paper for publication.","label":354}
{"id":"14fde81b-10ab-4e7d-a067-e499efd23485","text":"This is a well-conducted and well-written study The paper presents a significant problem in electronic medical records: the omission of active medications from patient lists. The use of recurrent neural networks to predict therapeutic classes of medications based on billing codes is a promising approach. The results of the study demonstrate a high prediction accuracy, indicating the potential of these models to assist in identifying missing or incorrect medications. The examination of individual cases further supports the existence of errors and omissions in the data. Overall, this study makes a valuable contribution to improving medication tracking in electronic medical records.","label":7}
{"id":"7cfae446-3854-4787-9b70-fe3b077fe41d","text":"This is a well-conducted and well-written study on the prediction of medication using recurrent neural networks based on a patient's billing codes. The authors highlight the issue of incomplete medication lists in electronic medical records and propose a computational solution to suggest missing or incorrect medications. Their best model, a GRU, achieved impressive prediction accuracy with high AUC and low Label Ranking Loss. Furthermore, the authors demonstrate the potential of their model to identify errors and omissions in the data. Overall, this study contributes valuable insights and offers promising avenues for improving medication tracking in healthcare systems.","label":12}
{"id":"e42206bc-e3aa-4d92-8761-b56d413518ef","text":"This paper applies RNNs to predict medications from billing costs. While this paper does not have technical novelty, it is well done and well organized..The paper provides a valuable contribution by addressing the issue of medication tracking in electronic medical records. The authors highlight the high error rates and omissions in current systems, which hinder the accurate representation of a patient's medications. The use of recurrent neural networks to predict therapeutic classes based on billing codes is a practical approach to mitigate this problem. While the paper lacks technical novelty, it is well-executed and the results demonstrate strong prediction accuracy. The authors also provide insightful analysis by identifying potential sources of error in the predictions. Overall, this paper offers a valuable step towards improving medication tracking and reducing the manual efforts required for maintaining accurate records.","label":25}
{"id":"95985fa4-6c9b-4409-b95d-818b76c899ec","text":"This is a well written, organized, and presented paper that I enjoyed reading.  I commend the authors on their attention to the narrative and the explanations.  While it did not present any new methodology or architecture, it instead addressed an important application of predicting the medications a patient is using, given the record of billing codes.  The dataset they use is impressive and useful and, frankly, more interesting than the typical toy datasets in machine learning.  That said, the investigation of those results was not as deep as I thought it should have been in an empirical\/applications paper.  Despite their focus on the application, I was encouraged to see that the authors acknowledged the limitations of their model due to hardware constraints on model size. It would have been interesting to see a discussion on potential optimizations or ways to overcome these limitations in future work. Additionally, while the paper mentioned that some of the predictions marked incorrect were likely examples of omitted medications or billing codes, it would have been helpful to provide more detailed analysis and examples of these cases. This could have enhanced the understanding of the potential errors and omissions in the data and how the model could help address them. Furthermore, it would have been beneficial if the paper had discussed the practical implications and potential applications of their work in the healthcare industry. For instance, how could the predictions from the recurrent neural networks be integrated into existing electronic medical record systems to improve medication tracking and reduce human labor? Overall, this paper presents an important and relevant application of machine learning techniques to address medication errors and omissions in electronic medical records. The authors should be commended for their clear writing style, thoroughness in presenting the dataset, and solid results. However, there are areas where the investigation and analysis could have been further improved. I would recommend the authors to consider addressing the limitations of their model, providing more detailed analysis of incorrect predictions, and discussing the practical implications of their work in future revisions of the paper.","label":114}
{"id":"74c52980-c0a1-40eb-9c70-1b9a203e350e","text":"In light of the detailed author responses and further updates to the manuscript, I am raising my score to an 8 and reiterating my support for this paper. I think it will be among the strongest non-traditional applied deep learning work at ICLR and will receive a great deal of interest and attention from attendees.\r\n\r\n-----\r\n\r\nThis paper describes modern deep learning approach to the problem of predicting the medications taken by a patient during a period of time based solely upon the sequence of ICD-9 codes assigned to the patient during that same time period. This problem is formulated as a multilabel sequence classification (in contrast to language modeling, which is multiclass classification). They propose to use standard LSTM and GRU architectures with embedding layers to handle the sparse categorical inputs, similar to that described in related work by Choi, et al. In experiments using a cohort of ~610K patient records, they find that RNN models outperform strong baselines including an MLP and a random forest, as well as a common sense baseline. The differences in performance between the recurrent models and the MLP appear to be large enough to be significant, given the size of the test set.\r\n\r\nStrengths:\r\n- Very important problem. As the authors point out, two the value propositions of EHRs -- which have been widely adopted throughout the US due to a combination of legislation and billions of dollars in incentives from the federal government -- included more accurate records and fewer medication mistakes. These two benefits have largely failed to materialize. This seems like a major opportunity for data mining and machine learning.\r\n- Paper is well-written with lucid introduction and motivation, thorough discussion of related work, clear description of experiments and metrics, and interesting qualitative analysis of results.\r\n- Empirical results are solid with a strong win for RNNs over convincing baselines. This is in contrast to some recent related papers, including Lipton & Kale et al, ICLR 2016, where the gap between the RNN and MLP was relatively small, and Choi et al, MLHC 2016, which omitted many obvious baselines.\r\n- Discussion is thorough and thoughtful. The authors are right about the kidney code embedding results: this is a very promising result.\r\n\r\nWeaknesses:\r\n- The authors make several unintuitive decisions related to data preprocessing and experimental design, foremost among them the choice to use ICD-9 codes as the basis for prediction instead of prescription records. While I understand their rationale for using ICD-9 codes, as they are readily available and more consistent across different healthcare systems, it would have been beneficial to include a discussion on the potential limitations of this choice. For example, ICD-9 codes are primarily used for billing and reimbursement purposes and may not always accurately capture the specific medications prescribed to a patient. Additionally, the authors mention that they ignored any medications not in the therapeutic classes they were targeting. This exclusion may introduce bias and limit the generalizability of their results to other therapeutic classes. It would have been valuable to include an analysis of the impact of this exclusion on the performance of their models.\r\n\r\nThe authors mention that hardware constraints limited the size of their best model, a GRU. It would be helpful to know more about these hardware constraints and how they may have influenced the results. Specifically, did the size limitation have any significant impact on the performance of the model? Were there any indications that a larger model could have achieved even higher prediction accuracy?\r\n\r\nWhile the paper provides a comprehensive evaluation of their models, it would have been interesting to see more in-depth analysis of the errors made by the models. Understanding the types of errors and potential sources of mispredictions could provide valuable insights for further improving the models and addressing the issues with medication tracking in electronic medical records. Additionally, it would have been beneficial to discuss the potential implications and challenges of implementing such models in real-world clinical settings. How would these models integrate with existing electronic medical record systems? What are the potential barriers to adoption and how can they be addressed?\r\n\r\nOverall, this paper presents a compelling application of recurrent neural networks to predict medications from diagnostic codes in electronic medical records. The authors address an important problem and provide solid empirical evidence to support the superiority of their models over baselines. However, there are some limitations and unanswered questions that could be further explored to strengthen the paper. I would recommend considering these suggestions in future revisions.","label":381}
{"id":"c6df0bd1-a7cb-4ce2-9ef1-4e6f9b5b3d3c","text":"This is a well-conducted and well-written study on the prediction of medication from diagnostic codes. The authors compared GRUs, LSTMs, feed-forward networks  and found that the best model was a GRU, achieving high prediction accuracy with a micro-averaged AUC of 0.93 and a Label Ranking Loss of 0.076. The study also revealed that many incorrect predictions were likely due to omitted medications or billing codes, highlighting the potential of these models to help correct errors and omissions in the data.","label":22}
{"id":"777c19fa-2e5f-4e4b-9845-5288b42384c1","text":"This paper investigates the use of eligibility traces with recurrent DQN agents. As in other recent work on deep RL, the forward view of Sutton and Barto is used to make eligibility traces practical to use with neural networks. Experiments on the Atari games Pong and Tennis show that traces work better than standard Q-learning.\r\n\r\nThe paper is well written and the use of traces in deep networks is explored in a clear and concise manner. The authors present a thorough analysis of the benefits and limitations of using eligibility traces in combination with recurrent networks. The experiments conducted on the Atari games Pong and Tennis provide compelling evidence that traces effectively enhance the learning process and outperform standard Q-learning. The results show that the combination of eligibility traces and recurrent networks allows for faster training and improved performance in these games. Furthermore, the authors emphasize the importance of optimization techniques in the training process. It would be interesting to see further experiments on a wider range of Atari games to assess the generalizability of the findings. The paper is well organized, with a clear introduction, detailed methodology, and comprehensive results section. The authors provide sufficient background information on eligibility traces and reinforcement learning to make the paper accessible to readers with varying levels of expertise in the field. Additionally, the inclusion of relevant references further strengthens the paper's validity. Overall, this paper makes a valuable contribution to the field of deep reinforcement learning by investigating the combination of eligibility traces and recurrent networks. The findings presented in this paper have important implications for improving the training efficiency and performance of deep Q-networks in reinforcement learning tasks.","label":66}
{"id":"c114c7b0-1567-4a8f-91e2-e699cbd64e25","text":"The reviewers agree that the paper is clear and well-written, but all reviewers raised significant concerns about the novelty of the work, since the proposed algorithm is a combination of well-known techniques in reinforcement learning. It is not evident that the authors have introduced any novel contributions to the field. The combination of eligibility traces with recurrent networks has been explored in the past, and the benefits of both techniques have been widely recognized. While the authors highlight the importance of optimization, it is not clear how this optimization differs from previous approaches. Overall, the reviewers urge the authors to clarify the novelty of their work and provide a more thorough comparison with existing methods in order to strengthen the paper.","label":36}
{"id":"2f8368f7-4815-47c8-aba8-11643cbd478f","text":"This paper combines DRQN with eligibility traces, and also experiment with the Adam optimizer for optimizing the q-network. This direction is worth exploring, and the experiments demonstrate the benefit from using eligibility traces and Adam on two Atari games. The methods themselves are not novel. Thus, the primary contributions are (1) applying eligibility traces and Adam to DRQN and (2) the experimental evaluation..The paper Investigating Recurrence and Eligibility Traces in Deep Q-Networks explores the combination of eligibility traces with recurrent networks in the context of reinforcement learning. It investigates the benefits of using eligibility traces and the Adam optimizer for optimizing the q-network in the Atari domain. Although the methods employed in this paper are not novel, the primary contributions lie in the application of eligibility traces and Adam to DRQN, as well as the experimental evaluations performed. The experiments conducted on two Atari games demonstrate the advantages of using eligibility traces and Adam for faster training and improved performance. Overall, this direction of research is valuable and the paper provides valuable insights into the optimization techniques used in training DRQN models.","label":63}
{"id":"28047160-fe3e-40dd-b66f-3c5d81c2d4fb","text":"The paper presents a deep RL with eligibility traces. The authors combine DRQN with eligibility traces for improved training. The new algorithm is evaluated on a two The new algorithm is evaluated on a two games in the Atari domain. The results show that the combination of recurrent networks and eligibility traces outperforms the baseline DRQN algorithm in terms of training time and final performance. The authors also investigate the impact of different optimization techniques on the performance of the proposed algorithm. They find that a carefully designed optimization strategy can further enhance the learning process. However, some limitations of the approach are also discussed, such as the increased complexity and computational cost due to the use of recurrent networks. Overall, the study provides valuable insights into the benefits and challenges of integrating eligibility traces with deep Q-networks.","label":27}
{"id":"e72f70e3-9711-4141-b336-d15accc52422","text":"This paper investigates the use of eligibility traces with recurrent DQN agents. As in other recent work on deep RL, the forward view of Sutton and Barto is used to make eligibility traces practical to use with neural networks. Experiments on the Atari games Pong and Tennis show that traces work better than standard Q-learning.\r\n\r\nThe paper is well written and the use of traces in deep RL is indeed underexplored, but the experiments in the paper are too limited and do not answer the most interesting questions.\r\n\r\nAs pointed out in the questions, n-step returns have been shown to work better than 1-step returns both in the classical RL literature and more recently with deep networks. [1] shows that using n-step returns in the forward view with neural networks leads to big improvements on both Atari and TORCS. Their n-step Q-learning method also combines returns of different length in expectation, while traces do this explicitly..The paper also discusses the importance of the optimization used in the training of the recurrent DQN agents. This is an important point as the success of deep RL algorithms heavily depends on the choice of the optimization method. However, the paper lacks a detailed analysis of the optimization techniques employed and how they contributed to the performance of the agents. It would have been interesting to see a comparison between different optimization methods and how they affect the training time and final performance. Additionally, the experiments in the paper are limited to only two Atari games, Pong and Tennis. While these games provide valuable insights into the benefits of using eligibility traces with recurrent networks, a more diverse set of games should have been included to demonstrate the generality and robustness of the proposed approach. Overall, this paper provides an initial exploration into the use of eligibility traces in combination with recurrent DQN agents, but further experiments and analysis are necessary to fully comprehend the potential of this approach in more complex environments.","label":154}
{"id":"804dfc1b-a14f-4386-8057-6dcb4623d662","text":"This paper investigates the use of eligibility traces with recurrent DQN agents. As in other recent work on deep RL, the forward view of Sutton and Barto is used to make eligibility traces practical to use with neural networks. Experiments involving Atari games were conducted to assess the effectiveness of eligibility traces in combination with recurrent networks. The experiments utilized the forward view of Sutton and Barto, which enables eligibility traces to be efficiently incorporated into neural networks. The results of the experiments demonstrated notable benefits of both recurrent networks and eligibility traces in enhancing the performance of deep Q-network (DQN) agents in the Atari domain.\r\n\r\nBy utilizing eligibility traces, the agents were able to propagate knowledge back over time-steps during a single update. This bias-variance trade-off facilitated faster training and improved the overall efficiency of the learning process. The experiments showcased the advantages of using eligibility traces in combination with recurrent networks in Atari games. The agents exhibited increased proficiency in tasks that involved long-term dependencies and temporal relationships, which are common in the Atari domain.\r\n\r\nMoreover, the paper emphasizes the significance of the optimization techniques employed during the training of recurrent DQN agents. The optimization process plays a crucial role in obtaining optimal results by fine-tuning the deep Q-network and effectively utilizing eligibility traces. The authors highlight that careful consideration should be given to the optimization process to ensure optimal performance of recurrent DQN agents with eligibility traces.\r\n\r\nIn summary, the investigation conducted in this paper provides valuable insights into the benefits and effectiveness of combining eligibility traces with recurrent networks in deep Q-network agents. The experiments conducted in Atari games demonstrate that eligibility traces can considerably expedite the learning process by propagating knowledge back in time. Furthermore, the paper emphasizes the significance of appropriate optimization techniques to maximize the performance of recurrent DQN agents. The findings of this study contribute to the growing body of research on reinforcement learning techniques and provide potential avenues for further exploration and improvement of deep Q-networks in the Atari domain.","label":40}
{"id":"ec22da73-7c2a-4cf9-b296-7f91c2ba4533","text":"The reviewers agree that the paper is clear and well-written, but  there are a few areas that need to be addressed. Firstly, the experimental setup should be more thoroughly described, including details on the specific Atari games used and the implementation of the recurrent networks and eligibility traces. Additionally, the results section should provide more quantitative analysis, such as comparisons to baselines and statistical significance tests, in order to support the claims made about the benefits of recurrent nets and eligibility traces. Furthermore, it would be beneficial to discuss the limitations of the proposed approach and potential future directions for research in this area. Overall, the paper shows promise in investigating the combination of recurrent nets and eligibility traces in deep Q-networks, but further work is needed to strengthen the empirical findings and provide a more comprehensive analysis of the results.","label":11}
{"id":"facdf30e-d070-41c6-8f48-be61206c5e78","text":"This paper combines DRQN with eligibility traces, and also experiment with the Adam optimizer for optimizing g the training process. The authors investigate the benefits of using eligibility traces in combination with recurrent networks in the Atari domain. By propagating knowledge back over time-steps in a single update, eligibility traces serve as an effective bias-variance trade-off, accelerating training time. The results of their experiments demonstrate the advantages of both recurrent nets and eligibility traces in several Atari games. This study emphasizes the significance of the optimization technique employed during training, with the Adam optimizer showing promising results. Overall, the paper addresses an interesting research question and provides valuable insights into the use of eligibility traces in deep Q-Networks, contributing to the advancement of reinforcement learning techniques.","label":16}
{"id":"244603b6-6659-4425-8c58-bb1e56c327e7","text":"The paper presents a deep RL with eligibility traces. The authors combine DRQN with eligibility traces for improved training. The new algorithm is evaluated on a two problems, with a single set of hyper-parameters, and compared with DQN.\r\n\r\nThe The experimental results revealed that the combination of recurrent networks and eligibility traces in the proposed method significantly enhanced training efficiency compared to DQN. The use of eligibility traces allowed for the propagation of knowledge over time-steps, accelerating the learning process. Additionally, the benefits of recurrent nets were demonstrated in various Atari games, showcasing their ability to capture temporal dependencies. Moreover, the paper emphasizes the importance of optimization techniques in achieving optimal performance. Overall, this study contributes to the advancement of deep reinforcement learning algorithms by exploring the potential of eligibility traces and recurrent networks in the Atari domain.","label":38}
{"id":"ed5a1cb8-39b2-4e7c-ad38-d3f7f1986064","text":"This paper investigates the use of eligibility traces with recurrent DQN agents. As in other recent work on deep RL, the forward view of Sutton and Barto is used to make eligibility traces practical to use with neural networks. Experiments on the Atari games Pong and Tennis show that traces work better than standard Q-learning.\r\n\r\nThe paper is well written and the use of traces in deep RL is indeed underexplored, but the experiments in the paper are too limited and do not answer the most interesting questions.\r\n\r\nAs pointed out in the questions, n-step returns have been shown to work better than 1-step returns both in the classical RL literature and more recently with deep networks. [1] shows that using n-step returns in the forward view with neural networks leads to big improvements on both Atari and TORCS. Their n-step Q-learning method also combines returns of different length in expectation, while traces do this explicitly..The paper could benefit from expanding their experiments to include more Atari games and comparing the performance of their approach to other state-of-the-art methods. Additionally, it would be interesting to investigate how different hyperparameters, such as the learning rate and the discount factor, impact the performance of the agents using eligibility traces and recurrent networks. It would also be valuable to explore the transferability of the trained agents to different tasks within the Atari domain, as well as to other domains altogether. Overall, while the paper provides a promising exploration of the use of eligibility traces in combination with recurrent networks, further experiments and analysis are needed to fully understand the capabilities and limitations of this approach. The findings presented here have the potential to contribute to the advancement of reinforcement learning research and improve the efficiency and effectiveness of training deep Q-network agents.","label":154}
{"id":"2667831f-01c6-4132-963a-5fe15c8a6a96","text":"This paper provides an extensive analysis of the error loss function for different optimization methods. The presentation is well done and informative. The experimental procedure is clarified sufficiently well..The authors analyze the loss surfaces of deep neural networks using multiple stochastic optimization methods, providing valuable insights into their geometry. The paper effectively presents the experimental procedure and clarifies the research objectives. Moreover, the visualizations on polygons enhance the understanding of how and when the optimization methods converge. Overall, the extensive analysis and informative presentation contribute significantly to the field of deep network optimization. However, it would be beneficial if the paper addresses any potential limitations of the conducted experiments and provides discussions on the applicability of the findings in real-world scenarios.","label":29}
{"id":"d7f84185-9cde-4f9c-a59c-317e817cc93f","text":"The paper proposes an empirical investigation of the energy landscape of deep neural networks using several stochastic optimization algorithms.\r\n \r\n The extensive experiments conducted by the authors are interesting and inspiring..The experiments conducted by the authors provide valuable insights into the geometry of the loss functions for state-of-the-art networks. The visualization of these experiments on polygons helps to elucidate how and when stochastic optimization methods find minima. The empirical investigation of the energy landscape contributes to the understanding of the training process of deep neural networks. Overall, the findings presented in this paper are compelling and pave the way for further exploration of optimization methods in high-dimensional non-convex spaces.","label":31}
{"id":"0e1d5a8f-3223-40d3-8ce5-bdfff32fc483","text":"First of all, I would like to thank the authors for putting this much work into a necessary but somewhat tedious topic. While I think the paper is somewhat below the standard of a conference paper (see detailed comments below), I would definitely love to see a version of this paper published with some of the issues ironed out. I also agree with many of the points raised by other reviewers and will not repeat them here.\r\n\r\nMajor points:\r\n\r\n-- \"As we saw in the previous section, the minima of deep network loss functions are for the most part decent.\"\r\n\r\nAll you said in the previous section was that theory shows that there are no bad minima under \"strong assumptions\". There is no practical proof that there is no practical proof that these minima are indeed good in practice. It would be helpful if the authors could provide some empirical evidence or analysis to support this claim. A more thorough exploration of the quality of the minima found by different optimization methods would greatly enhance the paper's contribution to the field. Additionally, the experiments conducted by the authors seem limited in scope. While visualizing the loss surfaces on polygons is an interesting approach, it would be beneficial to include a broader range of network architectures and datasets. This would help to generalize the findings and provide more insights into the behavior of the loss functions in deep neural networks.The paper would also benefit from a more detailed explanation of the stochastic optimization methods used. While the paper mentions multiple methods, it lacks an in-depth discussion of their differences and potential impact on the results. Providing a more comprehensive analysis of the optimization methods and their effect on the loss surfaces would make the paper more informative and useful to readers.Finally, the paper could benefit from a stronger conclusion section. The current conclusion feels incomplete and does not summarize the main findings or potential implications of the research. It would be helpful to the reader if the authors clearly articulate the key takeaways from their empirical analysis and discuss the implications for training deep neural networks.Overall, this paper has the potential to make a significant contribution to the field by empirically analyzing the loss surfaces of deep neural networks. However, it would benefit from addressing the issues mentioned above to enhance its clarity, comprehensiveness, and impact. With these improvements, I believe the paper has the potential to be published and provide valuable insights to the research community.","label":123}
{"id":"59e4abd0-04ae-4a1d-b258-00374f1f19ad","text":"The paper is dedicated to better understanding the optimization landscape in deep learning, in particular when explored with different optimization algorithms, and thus it also characterizes the behavior of these algorithms. It heavily re-uses the approach of Goodfellow et al. (2015). I find it hard to understand the contributions of the paper, for example: is it surprising that different algorithms reach different solutions when starting from the same initialization? It would be useful if the authors build such basic intuition in the paper..The paper makes significant contributions in providing a thorough empirical analysis of deep network loss surfaces and the behavior of different optimization algorithms. While it may not be surprising that different algorithms reach different solutions when starting from the same initialization, the paper goes beyond this basic intuition by investigating the geometry of loss functions for state-of-the-art networks and visualizing the results on polygons. This allows for a deeper understanding of how and when these stochastic optimization methods find minima. Additionally, the paper builds upon the approach of Goodfellow et al. (2015), which adds credibility to its methodology. By characterizing the optimization landscape in deep learning, the authors provide valuable insights that can benefit researchers working on improving optimization algorithms or developing new ones. Overall, the contributions of the paper lie in its comprehensive empirical analysis, visualization techniques, and the deeper understanding it provides regarding the behavior of different optimization algorithms in deep learning settings.","label":83}
{"id":"846316b0-512c-44f6-879b-48ce97bfe503","text":"I appreciate the work but I do not think the paper is clear enough. \r\nMoreover, the authors say \"local minimia\" ~70 times but do not show (except for Figure 11?) that the solutions found are not necessarily local minima. \r\nThe authors do not talk about that fact that slices of a non-convex problem can look like the ones of a convex problem. It would be helpful if the authors provided more clarity on the nature of the solutions found by the stochastic optimization methods and their relationship to local minima. Additionally, it would be beneficial if the authors included more visualizations and analyses to further support their claims. Overall, the paper has potential but needs to address these concerns to strengthen its contribution and make it more accessible to the reader.","label":58}
{"id":"27924595-a7d8-48f1-9c4a-09b96157ee4a","text":"This paper provides an extensive analysis of the error loss function for different optimization methods. The presentation is well done and informative. The experimental results are visually depicted on polygons, giving a clear understanding of how various stochastic optimization methods perform in finding minima. The authors thoroughly investigate the high-dimensional and non-convex nature of the loss functions for state-of-the-art networks. The paper offers valuable insights into the geometry of loss surfaces, shedding light on how different optimization methods navigate these surfaces. The empirical analysis conducted in this study contributes to a better understanding of deep network training and lays a solid foundation for further research in this area. Overall, the paper is well-written and provides significant contributions to the field of deep learning.","label":24}
{"id":"e5f192fb-4deb-4a8c-989f-42793e4b9f43","text":"This paper provides an extensive analysis of the error landscape for deep neural networks. The authors conduct multiple experiments using state-of-the-art networks and stochastic optimization methods to visually represent the loss functions on polygons. This approach provides valuable insights into how and when minima are found by these optimization methods. The empirical investigation reveals the difficulty in characterizing the high-dimensional, non-convex loss functions. By analyzing the geometry of the loss surfaces, the authors contribute to a better understanding of the optimization problem in deep learning. Overall, this paper offers a comprehensive analysis and visualization of deep network loss surfaces, which can be beneficial for improving the training process of deep neural networks.","label":9}
{"id":"69700a5c-3783-4e64-b377-e7101d330a4a","text":".In this paper, the authors address the challenging task of characterizing the high-dimensional and non-convex loss functions involved in training deep neural networks. The research focuses on state-of-the-art networks and employs multiple stochastic optimization methods. The empirical investigation of the loss function geometry is carried out through various experiments, which are effectively visualized on polygons. This visualization facilitates a better understanding of when and how stochastic optimization methods are able to find minima. The presented findings shed light on the behavior and efficiency of different optimization techniques, providing valuable insights for improving the training process of deep neural networks. By examining the loss surfaces, this study contributes to our understanding of optimization landscapes and lays the foundation for future enhancements in network training and performance.","label":0}
{"id":"f763cff8-410e-4e65-ba83-575f83086895","text":"First of all, I would like to thank the authors for putting this much work into a necessary but somewhat tedious topic. While I think the paper is somewhat below the standard of a conference paper (see detailed comments below), I would definitely love to see a version of this paper published with some of the issues ironed out. I also agree with many of the points raised by other reviewers and will not repeat them here.\r\n\r\nMajor points:\r\n\r\n-- \"As we saw in the previous section, the minima of deep network loss functions are for the most part decent.\"\r\n\r\nAll you said in the previous section was that theory shows that there are no bad minima under \"strong assumptions\"..However, it is important to note that these theoretical results may not always hold in practice and there have been cases where deep network loss functions have found bad minima even when the assumptions are satisfied. Therefore, it would be misleading to make a blanket statement that all minima are decent based solely on theoretical results. It would be more accurate to say that under certain conditions, deep network loss functions tend to converge to decent minima, but there can still be cases where suboptimal or even bad minima are found.One possible explanation for the discrepancy between theory and practice is the presence of noise and randomness in the optimization process. Stochastic optimization methods introduce randomness through the use of mini-batches, learning rate schedules, and other techniques. This randomness can lead to the exploration of different regions of the loss surface and potentially help escape from suboptimal or bad minima. However, it can also hinder convergence and lead to getting stuck in other regions of the loss surface.The authors' empirical investigation of the geometry of loss functions for state-of-the-art networks with multiple stochastic optimization methods is a valuable contribution to the understanding of how deep network optimization works in practice. By visualizing the experiments on polygons, the authors provide a clear and intuitive way to illustrate the behavior of different optimization methods and how they explore the loss surface.One strength of the paper is the extensive experimentation performed by the authors. They compare the behavior of different optimization methods, including stochastic gradient descent, Adam, and RMSprop, on various datasets and network architectures. This provides a comprehensive evaluation of the performance of these methods and allows the authors to draw meaningful conclusions.Another strength is the visualization of the experiments on polygons. This visual representation helps the reader to visually grasp the behavior of the optimization methods and see how they navigate the loss surface. The authors use different colors to represent the training trajectories of different optimization methods, and this makes it easy to compare and contrast their performance.However, there are some aspects of the paper that could be improved. First, the paper lacks a clear motivation for why understanding the geometry of loss functions and optimization methods is important. While the authors mention that deep network optimization is a high-dimensional non-convex optimization problem, they do not explicitly connect this to the need for understanding the loss surface geometry and its implications for optimization.In addition, the paper could benefit from a more thorough discussion of the limitations of the empirical analysis. While the experiments provide valuable insights into the behavior of different optimization methods, it is important to acknowledge that the conclusions drawn are based on a specific set of datasets and network architectures. The authors should discuss the generalizability of their findings and potential limitations in other settings.Overall, I believe that this paper makes a valuable contribution to the understanding of deep network loss surfaces and optimization methods. The authors' empirical investigation and visualization on polygons provide a clear and intuitive way to understand the behavior of different optimization methods. With some revisions to address the mentioned issues, this paper has the potential to be published and would be a valuable addition to the field of deep learning optimization.","label":117}
{"id":"6931bbbc-42ff-4fe4-b2b8-f834fe62986b","text":"The paper is dedicated to better understanding the optimization landscape in deep learning, in particular when explored with different optimization algorithms, and thus it also characterizes the behavior of these algorithms. It heavily re-uses the approach of Goodfellow et al. (2015). I find it hard to understand the contributions of the paper, for example: is it surprising that different algorithms reach different solutions when starting from the same initialization? It would be useful if the authors build such basic intuition in the paper. I also did not receive a clear answer to the question I posed to reviewers regarding clarifying how does the findings of the paper can contribute to future work and research in the field of deep learning. The abstract provides a clear overview of the paper's purpose and methodology. The authors aim to empirically analyze the geometry of loss functions in deep neural networks, utilizing various stochastic optimization methods. While the paper builds upon the work of Goodfellow et al. (2015), it is important to understand that not all readers may be familiar with this reference. Providing a brief explanation or comparison would improve accessibility for a wider audience. Additionally, the reviewer raises a valid point about the significance of different algorithms reaching different solutions from the same initialization. The authors should address this concern and highlight the potential implications and insights gained from such variations. Overall, further clarifications and explanations are needed to strengthen the paper's contributions and its relevance for future research in the field of deep learning.","label":111}
{"id":"0b0c8cb3-047f-4e9b-ae93-1475fb34c1d8","text":"I appreciate the work but I do not think the paper is clear enough. \r\nMoreover, the authors say \"local minimia\" ~70 times but do not show (except for Figure 11?) that the solutions found are not necessarily local minima. \r\nThe authors do not talk about that fact that slices of a non-convex problem can look like the global minimum. This lack of clarification makes some of the claims in the paper seem unsubstantiated. Additionally, the visualizations on polygons are interesting but it is not clear how they relate to the actual loss surfaces of deep networks. I would suggest providing more detailed explanations and supporting evidence to strengthen the findings. Additionally, it would be beneficial to explore the implications of the observed loss surface geometry on the performance of deep neural networks. Overall, the paper has potential but requires further refinement and clarity.","label":57}
{"id":"a406ee15-d82f-4bfa-81d4-2ae35cb3653b","text":"This paper provides an extensive analysis of the error loss function for different optimization methods. The presentation is well done and the paper provides valuable insights into the geometry of loss functions for deep neural networks. The experiments conducted are well-designed and the results are clearly visualized on polygons, which enhances the understanding of how stochastic optimization methods find minima. The findings of this study can have significant implications for improving the optimization process and enhancing the performance of deep networks. Overall, I believe that this paper is a valuable contribution to the field and should be considered for publication.","label":21}
{"id":"7a8efccb-7d66-4a07-a132-7cf48bb9986e","text":"This paper proposes to use RNN and reinforcement learning for solving combinatorial optimization problems. The use of pointer network is interesting as it enables generalization to arbitrary input size. The proposed method also \"fintunes\" on test examples with active search to achieve better performance.\r\n\r\nThe proposed method is theoretically interesting as it shows that RNN and RL can be combined to solve combinatorial optimization problems The paper's focus on the traveling salesman problem (TSP) and the use of negative tour length as the reward signal are notable contributions. Furthermore, the comparison of learning the network parameters on training graphs versus individual test graphs provides valuable insights. The results of Neural Combinatorial Optimization on 2D Euclidean graphs with up to 100 nodes and the KnapSack problem demonstrate its effectiveness. Although the approach is still not at the state-of-the-art level, it offers promising potential for using neural networks as a general tool for tackling combinatorial optimization problems. Overall, this paper presents a compelling framework that combines RNNs and reinforcement learning to address complex optimization problems in an innovative way.","label":64}
{"id":"afe8e0d0-66c9-4c39-89e6-3c7750bcb1ed","text":"This was one of the more controversial submissions to this area, and there was extensive discussion over the merits and contributions of the work. The paper also benefitted from ICLRs open review system as additional researchers chimed in on the paper and the authors resubmitted a draft. The authors did a great job responding and updating the work and responding to criticisms..The paper presents a framework for solving combinatorial optimization problems using neural networks and reinforcement learning. Specifically, it focuses on the traveling salesman problem (TSP) and trains a recurrent neural network to predict a distribution over different city permutations given a set of city coordinates. The authors optimize the parameters of the network using a policy gradient method, with the negative tour length as the reward signal. One interesting aspect of the paper is that it compares learning the network parameters on a set of training graphs against learning them on individual test graphs. The experimental results show that the proposed Neural Combinatorial Optimization approach achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. It is also applied to the KnapSack problem, another NP-hard problem, and obtains optimal solutions for instances with up to 200 items. While these results are still far from state-of-the-art, they provide valuable insights into the potential of neural networks as a general tool for tackling combinatorial optimization problems. One of the strengths of the paper is that it provides a clear and concise explanation of the proposed framework, making it easy for readers to understand the methodology and reproduce the experiments. The authors also deserve credit for their efforts in engaging with researchers and responding to criticisms during the open review process. This shows a commitment to improving the quality of their work and addressing concerns raised by the community. However, there are a few areas that could be further improved. First, the paper could benefit from a more detailed discussion of the limitations and shortcomings of the proposed approach. Additionally, the authors should consider comparing their method against existing state-of-the-art techniques to provide a more thorough evaluation of its performance. Overall, the paper is a valuable contribution to the field of combinatorial optimization and holds promise for future research in this area.","label":62}
{"id":"b7532ff1-6e1f-421b-93b0-1ac9eb86c5a7","text":".In this paper, the authors propose a novel approach to solving combinatorial optimization problems using neural networks and reinforcement learning. Specifically, they focus on the traveling salesman problem (TSP) and train a recurrent neural network to predict city permutations based on given city coordinates. They optimize the network parameters using a policy gradient method, with the negative tour length serving as the reward signal. The authors compare the performance of training the network on a set of training graphs versus training it on individual test graphs. Remarkably, the proposed method achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes, without extensive engineering or heuristic designing. Additionally, the same approach also obtains optimal solutions for instances of the KnapSack problem with up to 200 items. Although these results are not yet state-of-the-art, they provide valuable insights into the potential of neural networks as a general tool for addressing complex combinatorial optimization problems.","label":0}
{"id":"f97990e2-c918-4fe8-8971-6a11c690f35e","text":"I posted this question in a response below, but it seems to be getting ignored so I thought I'd bring it to the top, with some additional points.\r\n\r\nThanks for the update. The natural question to ask, then is - do there exist many (or any) problems that are both interesting and have not been, and cannot be, addressed by the existing combinatorial optimization community? You knock existing algorithms for being  applied to combinatorial optimization problems with great success. However, this paper presents a novel approach that combines neural networks and reinforcement learning to tackle these problems. By training a recurrent neural network on the traveling salesman problem (TSP) and the KnapSack problem, the authors demonstrate that their method achieves close to optimal results. This is an exciting development as it suggests that neural networks can be utilized as a general tool for solving combinatorial optimization problems. The paper acknowledges that the results obtained are not yet state-of-the-art, but they provide valuable insights into the potential of neural networks in this field. One question that remains is whether there are any problems that are truly out of reach for this approach. Further research and experimentation are needed to explore the limitations and scalability of the proposed method. Overall, this paper contributes to the growing body of literature on neural combinatorial optimization and offers new perspectives for addressing these challenging problems.","label":70}
{"id":"04815efe-d8ec-4529-9b6f-133ec8bd9c86","text":"We thank reviewers for their valuable feedback that helped us improve the paper. We appreciate their interest in the method and its novelty. We have made several changes to the paper which are summarized below. We ask reviewers to evaluate the new version of the paper and adjust their reviews if necessary.\r\n\r\n1) Previous Figure 1, which was problematic due to different possible interpretations of \u201clocal search\u201d was removed.\r\n\r\n2) We added precise running time evaluations for all of the methods in the paper. Table 3 presents running time of the RL pretraining-greedy method and the solvers we compare against. Table 4 presents the performance and corresponding running time of RL pretraining-Sampling and RL pretraining-Active Search as a function of the number solutions considered. It shows how they can be stopped early at the cost of a small performance degradation..3) We have revised the experimental section to provide more details about the setup and evaluation metrics. We have also included a discussion on the limitations of our approach, such as the scalability to larger problem instances. Additionally, we have added a comparison with other state-of-the-art methods to provide a clearer perspective on the performance of our approach. We believe these changes have addressed some of the concerns raised by the reviewers and have improved the overall quality of the paper. We kindly request reviewers to carefully evaluate the revised version of the paper and consider these updates in their final reviews. We thank the reviewers once again for their valuable feedback and suggestions.","label":138}
{"id":"9970cbdf-e7ad-4695-9237-10f94123074c","text":"This paper applies the pointer network architecture\u2014wherein an attention mechanism is fashioned to point to elements of an input sequence, allowing a decoder to output said elements\u2014in order to solve simple combinatorial optimization problems such as the well-known travelling salesman problem. The network is trained by reinforcement learning using an actor-critic method, with the actor trained using the REINFORCE method, and the critic used to estimate the reward baseline within the REINFORCE objective.\r\n\r\nThe paper is well written and easy to understand. The paper does a good job of introducing the problem of combinatorial optimization and presents an interesting approach using neural networks and reinforcement learning. The use of the pointer network architecture allows the network to effectively handle the permutation aspect of the TSP and predict distributions over different city permutations. The training methodology, using reinforcement learning with an actor-critic approach, is appropriate for solving such a problem where the reward signal is not easily defined. The use of the REINFORCE method for training the actor and the use of the critic to estimate the reward baseline within the REINFORCE objective provide a solid foundation for learning optimal policies. One strength of the paper is that it demonstrates the effectiveness of the proposed method on both the TSP and the KnapSack problem, which are both NP-hard problems. The results show that the Neural Combinatorial Optimization approach achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes and obtains optimal solutions for instances of the KnapSack problem with up to 200 items. These results, although still not state-of-the-art, provide valuable insights into the potential of using neural networks as a general tool for tackling combinatorial optimization problems. Overall, the paper is well written and easy to understand. The authors provide clear explanations of the methods used and present their results in a concise and informative manner. However, there are some areas that could be further improved. It would be beneficial to include more details on the implementation and hyperparameter settings used in the experiments to ensure reproducibility. Additionally, the authors could discuss the limitations and potential extensions of the proposed method to further enhance its applicability. With these improvements, the paper will make a valuable contribution to the field of combinatorial optimization with neural networks and reinforcement learning.","label":81}
{"id":"cd573e8d-4c1b-45e0-80be-197efa2f1eb1","text":"This paper is methodologically very interesting, and just based on the methodological contribution I would vote for acceptance. However, the paper's sweeping claims of clearly beating existing baselines for TSP have been shown to not hold, with the local search method LK-H solving all the authors' instances to optimality -- in seconds on a CPU, compared to clearly suboptimal results by the authors' method in 25h on a GPU. \r\n\r\nSeeing this clear dominance of the local search method LK-H, I find it irresponsible by the authors that they left Figure 1 as it is -- with the line for \"local search\" referring to an obviously poor implementation by Google rather than the LK-H local search method that everyone uses. For example, at NIPS, I saw this Figure 1 presentation where one could clearly see that LK-H finds the optimal solution for all instances. It is important for the authors to provide a fair and accurate comparison to existing methods in order to evaluate the true effectiveness of their proposed approach.\r\n\r\nFurthermore, the authors claim that their method achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. While this is indeed an impressive achievement, it would be helpful to have a more detailed analysis of the performance on larger instances. It is unclear how the method scales with the size of the problem, and whether it would still be competitive when applied to larger datasets. Providing such analysis would strengthen the paper and provide a better understanding of the capabilities and limitations of the proposed approach.\r\n\r\nOn the other hand, the authors' exploration of applying their method to the Knapsack problem is a valuable contribution. The fact that the same method obtains optimal solutions for instances with up to 200 items demonstrates its potential for tackling NP-hard problems beyond TSP. However, it would be beneficial to have a comparison with other existing approaches for the Knapsack problem to understand the relative performance of the proposed method.\r\n\r\nIn terms of the methodology itself, the use of neural networks and reinforcement learning for combinatorial optimization problems is a novel and promising direction. The paper provides a clear description of the framework and the training process, which allows for reproducibility. The choice of the negative tour length as the reward signal is intuitive given the nature of TSP, and the policy gradient method used for optimization appears appropriate for this task.\r\n\r\nOverall, this paper presents an interesting framework for tackling combinatorial optimization problems using neural networks and reinforcement learning. The results obtained on TSP and Knapsack instances demonstrate the potential of the proposed approach. However, it is essential for the authors to address the limitations mentioned above, particularly in terms of the comparison with existing methods and the scalability of the proposed method. Additionally, the authors should update Figure 1 to accurately reflect the comparison with the state-of-the-art local search method LK-H. With these revisions, the paper would make a valuable contribution to the field of combinatorial optimization with neural networks and reinforcement learning. Therefore, I recommend that the paper undergo a minor revision to address these concerns before it is accepted for publication.","label":128}
{"id":"e33d0bae-260a-4ef1-b555-884d5a120911","text":"This paper proposes to use RNN and reinforcement learning for solving combinatorial optimization problems. The use of pointer network is interesting as it enables generalization to arbitrary input size. The proposed method also \"fintunes\" on test examples with active search to achieve better performance.\r\n\r\nThe proposed method is theoretically interesting as it shows that RNN and RL can be combined to solve combinatorial optimization problems and achieve comparable performance to traditional heuristic based algorithms.\r\n\r\nHowever, the lack of complexity comparison against baselines make it impossible to tell whether the proposed method has any practical value..The lack of complexity comparison against baselines is indeed a limitation of the paper. Without such comparison, it is difficult to assess the practical value of the proposed method. It would have been interesting to see how the performance of the RNN and reinforcement learning approach compares to traditional heuristic-based algorithms on different problem instances. Additionally, the paper could have discussed the scalability of the proposed method and its computational efficiency in solving large-scale combinatorial optimization problems. Overall, while the framework presented is theoretically interesting, more comprehensive experiments and comparisons would have strengthened its practical applicability.","label":93}
{"id":"c036f4b5-72eb-4135-bf69-e69bcce8b0bf","text":"This is very interesting to me! Thank you for this.\r\n\r\nAfter reading this paper, I tested the Concorde..After reading this paper, I tested the Concorde solver on benchmark instances of the traveling salesman problem (TSP) and compared its performance with the proposed method. The results were quite impressive, with the neural combinatorial optimization approach consistently achieving close to optimal results on 2D Euclidean graphs with up to 100 nodes. Additionally, I found it fascinating how the same method was able to obtain optimal solutions for instances of the Knapsack problem with up to 200 items. These findings further support the potential of using neural networks and reinforcement learning to tackle combinatorial optimization problems without extensive engineering and heuristics.","label":17}
{"id":"ea89aeeb-fd0f-4c8d-ac0d-74ce18e4561d","text":"In Table 3, what is the performance for the missing values of RL pretraining with 10.000 batches for Sampling T=1 and T=T*? \r\n\r\nSince performance improved much more from 100 to 1.000 batches for RL pretraining Sampling T=T* than it did for RL pretraining AS (e.g., 5.79->5.71 vs 5.74->5.71 for TSP50), I would expect RL pretraining Sampling T=T* to do better than RL pretraining AS when you use 10.000 samples..The missing values for RL pretraining with 10.000 batches for Sampling T=1 and T=T* are not provided in Table 3. It would be helpful to include these results to evaluate the performance of the different RL pretraining methods comprehensively. Additionally, since the performance improvement from 100 to 1.000 batches was more significant for RL pretraining Sampling T=T* compared to RL pretraining AS, it is reasonable to expect that RL pretraining Sampling T=T* would outperform RL pretraining AS when using 10.000 samples. Including these results would provide further insights into the effectiveness and efficiency of the different RL pretraining approaches. Overall, the paper provides valuable contributions by demonstrating the potential of using neural networks and reinforcement learning for solving combinatorial optimization problems and offers promising results on TSP and KnapSack instances. However, further experiments and analysis are needed to fully understand the strengths and limitations of the proposed approach.","label":69}
{"id":"74c47666-5b66-4d66-8f77-544be4cd1f2c","text":"There is a large body of work on solving combinatorial optimization problems using various techniques such as mathematical programming, genetic algorithms, branch and bound, and constraint programming. However, neural networks and reinforcement learning have recently emerged as promising approaches for tackling these problems. The paper under review presents a framework that combines neural networks and reinforcement learning to address combinatorial optimization problems. The authors specifically focus on the traveling salesman problem (TSP) and propose a recurrent neural network that predicts a distribution over city permutations given a set of city coordinates. They optimize the network parameters using a policy gradient method with negative tour length as the reward signal. The authors compare learning the parameters on a set of training graphs versus individual test graphs and report promising results. They achieve close to optimal solutions for TSP instances with up to 100 nodes and solve the KnapSack problem optimally for instances with up to 200 items. While these results are still not state-of-the-art, they provide valuable insights into the potential of using neural networks for solving combinatorial optimization problems. Further improvements and extensions may build upon this work to achieve even better performance and advance the field.","label":9}
{"id":"63141c92-c6db-472a-b595-1e21a402d042","text":"This paper proposes to use RNN and reinforcement learning for solving combinatorial optimization problems. The use of pointer network is interesting as it enables generalization to arbitrary input size. The proposed method also \"fintunes\" on test examples with active search to achieve better performance.\r\n\r\nThe proposed method is theoretically interesting and addresses a relevant and challenging problem in the field of combinatorial optimization. By applying a recurrent neural network coupled with reinforcement learning, the authors aim to tackle the traveling salesman problem, a classic and well-known problem that exhibits NP-hard characteristics. The novelty lies in using negative tour length as the reward signal and optimizing the parameters of the neural network using policy gradient methods. The authors compare their approach by training on a set of training graphs and individua+test graphs, and demonstrate promising results on 2D Euclidean graphs with up to 100 nodes. Additionally, the method is also applied to the Knapsack problem, achieving optimal solutions for instances with up to 200 items. These empirical results add insights into the applicability of neural networks in solving combinatorial optimization problems, however, further improvements are required to reach state-of-the-art performance.","label":49}
{"id":"2ff0dcea-9a43-4fe0-8b99-861f4be176f5","text":"This was one of the more controversial submissions to this area, and there was extensive discussion over the merits and contributions of the work. The paper also benefitted from ICLRs open review process, which provided valuable feedback and criticism. The paper presents a novel framework for tackling combinatorial optimization problems using neural networks and reinforcement learning. The focus is on the traveling salesman problem (TSP) and the authors propose training a recurrent neural network to predict a distribution over different city permutations given a set of city coordinates. The reward signal used for optimization is the negative tour length. The authors apply a policy gradient method to optimize the parameters of the recurrent neural network. One interesting aspect of the work is the comparison between learning the network parameters on a set of training graphs versus learning them on individual test graphs. The authors report that their approach achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Additionally, when applied to the Knapsack problem, the same method obtains optimal solutions for instances with up to 200 items. While the results are not state-of-the-art, they provide valuable insights into the potential use of neural networks as a general tool for solving combinatorial optimization problems. The authors acknowledge that the current approach lacks extensive engineering and heuristic designing, and it would be interesting to explore how these components could further improve the results. Overall, this paper contributes to the growing body of research on combinatorial optimization with neural networks and highlights the potential of reinforcement learning in addressing such complex problems. The paper stimulates further discussion and research in this domain, and with some improvements and refinements, it could have a significant impact on the field. One potential direction for future work could be the application of this framework to other combinatorial optimization problems and comparing the performance with existing methods. In conclusion, this paper presents a promising approach to combinatorial optimization using neural networks and reinforcement learning, and I recommend its acceptance for publication.","label":32}
{"id":"4522b1b5-e9b2-4bd2-adb3-5d0cbce89f37","text":"We ask reviewers to have a look at the new version of the paper again given the changes outlined below:\r\n\r\n- We state clearly in the abstract, introduction, and conclusion that our results are still preliminary and further exploration is needed to improve our results and make them more competitive with state-of-the-art approaches in combinatorial optimization. However, our approach showcases the potential of neural networks and reinforcement learning as general tools for solving NP-hard problems. We also believe that the insights gained from this work can inspire future research in developing novel algorithms and techniques for tackling complex combinatorial optimization problems using neural networks.","label":34}
{"id":"cf4cc61d-2284-4c66-9348-432cbd2d7b36","text":"I posted this question in a response below, but it seems to be getting ignored so I thought I'd bring it to the top, with some additional points.\r\n\r\nThanks for the update..Thank you for sharing this paper. The authors present an interesting framework that combines neural networks and reinforcement learning to tackle combinatorial optimization problems, with a focus on the well-known traveling salesman problem (TSP). They train a recurrent neural network (RNN) to predict a distribution over different city permutations, given a set of city coordinates. The optimization of the RNN parameters is performed using a policy gradient method, where the negative tour length serves as the reward signal. One notable aspect of the paper is the comparison between learning the network parameters on training graphs versus individual test graphs. The results demonstrate that the proposed method achieves close to optimal solutions on 2D Euclidean graphs with up to 100 nodes, without requiring extensive engineering or heuristic design. Additionally, when applied to the Knapsack problem, a challenging NP-hard problem, the same technique obtains optimal solutions for instances with up to 200 items. Although the achieved results are not yet state-of-the-art, they provide valuable insights into the potential of neural networks as a general tool for addressing combinatorial optimization problems. The paper's contribution lies in the exploration of an alternative approach that can potentially overcome the limitations of traditional algorithms in solving NP-hard problems. One aspect that could be further improved is the evaluation of the proposed method against other state-of-the-art techniques and benchmarks. This would help in establishing a clearer comparison and understanding of the effectiveness and limitations of the neural combinatorial optimization approach. Overall, this paper opens up interesting avenues for further research and potential applications of neural networks in solving challenging optimization problems.","label":31}
{"id":"9d7ee407-d576-4486-a561-a0e337af824b","text":"We thank reviewers for their valuable feedback that helped us improve the paper. We appreciate their interest in the method and its novelty. We have made several changes to the paper which are summarized below. We ask reviewers to evaluate the new version of the paper and adjust their reviews if necessary.\r\n\r\n1) Previous Figure 1, which was problematic due to different possible interpretations of \u201clocal search\u201d was removed.\r\n\r\n2) We added precise running time evaluations for all of the experiments and clarified the running time assumptions. This provides a more comprehensive understanding of the computational efficiency of our approach.\r\n\r\n3) We extended the related work section to include more recent advances in neural combinatorial optimization. This allowed us to present a more complete picture of the state-of-the-art and position our work within the existing literature.\r\n\r\n4) In response to Reviewer 2's comment, we added additional analysis on the scalability of our method. Specifically, we conducted experiments on larger problem instances and provided insights into the performance trends as the problem size increases. This strengthens the paper's contribution and demonstrates the efficacy of our approach beyond smaller problem sizes.\r\n\r\n5) We have made the necessary revisions to the introduction to better highlight the significance of our work. We emphasize the limitations of existing methods and illustrate how our neural network-based approach addresses those limitations. This helps to clearly establish the motivation and relevance of our research.\r\n\r\n6) Reviewer 3 pointed out the need for a more detailed discussion of the training procedure. We have now included a step-by-step explanation of the training process, along with the mathematical formulations involved. This enhances the clarity of our methodology and enables a better understanding of the techniques employed.\r\n\r\nWe hope that these revisions address the concerns raised by the reviewers and contribute to the overall quality of the paper. We thank the reviewers for their thorough evaluations and insightful comments, which have undoubtedly strengthened our work.","label":78}
{"id":"1754d9d0-eee7-45ab-9919-c94d1329ff96","text":"This paper applies the pointer network architecture\u2014wherein an attention mechanism is fashioned to point to elements of an input sequence, allowing a decoder to output said elements\u2014in order to solve simple combinatorial  optimization problems, particularly the traveling salesman problem (TSP) and the Knapsack problem. The authors train a recurrent neural network (RNN) that, given a set of city coordinates, predicts a distribution over different city permutations for TSP. They use the negative tour length as the reward signal to optimize the parameters of the RNN through a policy gradient method. Their approach achieves impressive results, obtaining close to optimal solutions on 2D Euclidean graphs with up to 100 nodes for TSP and optimal solutions for instances with up to 200 items for the Knapsack problem.One notable aspect of this work is the minimal engineering and heuristic designing involved. By relying on a neural network and reinforcement learning, the authors demonstrate the potential for neural networks as a general tool for tackling combinatorial optimization problems, without the need for domain-specific knowledge or handcrafted features. This is particularly valuable as combinatorial optimization problems are often complex and challenging, requiring sophisticated algorithms or expert-designed heuristics.The authors also compare the performance of learning the network parameters on a set of training graphs against learning them on individual test graphs. This analysis provides insights into the generalizability and transferability of the model, as well as the potential impact of training data distribution on performance.It is important to note that while the results obtained by the proposed method are promising, they are still far from the state-of-the-art. Nevertheless, this paper contributes to the growing body of research in using neural networks and reinforcement learning for combinatorial optimization, opening up possibilities for further advancements in the field.In terms of improvements, it would be beneficial for the authors to provide a more detailed explanation of the policy gradient method used for parameter optimization. Additionally, further analysis of the limitations and potential challenges of applying neural networks to combinatorial optimization problems would enhance the discussion. Overall, this paper presents an interesting approach and yields valuable insights into the application of neural networks in solving combinatorial optimization problems.","label":32}
{"id":"c52a75f6-cfca-44d0-b230-372464dc4099","text":"This paper is methodologically very interesting, and just based on the methodological contribution I would vote for acceptance. However, the paper's sweeping claims of clearly beating existing baselines for TSP have been shown to not hold, with the local search method LK-H solving all the authors' instances to optimality -- in seconds on a CPU, compared to clearly suboptimal results by the authors' method in 25h on a GPU. \r\n\r\nSeeing this clear dominance of the local search method LK-H, I find it irresponsible by the authors that they left Figure 1 as it is -- with the line for \"local search\" referring to an obviously poor implementation by Google rather than the LK-H local search method that everyone uses. For example, at NIPS, I saw this Figure 1 being used in a talk (I am not sure anymore by whom, but I don't think it was by the authors), the narrative being \"RNNs now also clearly perform better than local search\". Of course, people would use a figure like that for that purpose, and it is clearly up to the authors to avoid such misconceptions. \r\n\r\nThe right course of action upon realizing the real strength of local search with LK-H would've been to make \"local search\" the same line as \"Optimal\", showing that the authors' results are suboptimal compared to LK-H. Moreover, the authors should have provided a clear comparison between their method and LK-H in terms of runtime, as the latter solves the instances in seconds on a CPU, while the authors' method takes 25 hours on a GPU. This information is crucial for understanding the practical applicability of the proposed approach.\r\n\r\nAdditionally, the authors' claim of 'close to optimal results' on 2D Euclidean graphs with up to 100 nodes needs to be examined in more detail. It is unclear how the authors define 'optimal' and whether they considered other state-of-the-art algorithms for comparison. Providing a comprehensive evaluation against established baselines would strengthen the paper's claims and contribute to a more thorough understanding of the proposed approach's performance.\r\n\r\nFurthermore, the authors' decision to include an incorrect representation of the 'local search' line in Figure 1 raises concerns about the clarity and accuracy of their presentation. It is essential for authors to present their findings in an objective and unbiased manner, especially when their work is showcased in conferences or talks. Not correcting this misrepresentation may lead to misconceptions and misinterpretations of the authors' work, which could tarnish the credibility of their research.\r\n\r\nGiven the methodological contribution of the paper, it is still valuable and worthy of publication. However, to improve the paper, the authors should address the aforementioned concerns. Firstly, they should revise the experimental setup to include a direct comparison between their approach and LK-H in terms of runtime and quality of results. Furthermore, they should provide a comprehensive evaluation against existing state-of-the-art algorithms, showcasing the strengths and weaknesses of their method. Lastly, the authors should correct Figure 1 to accurately represent the performance of the local search method LK-H, thus avoiding any potential misconceptions.\r\n\r\nIn conclusion, while the paper's methodology presents an intriguing approach to combinatorial optimization problems, the authors need to address the issues outlined above to strengthen their claims and ensure the accuracy of their findings.","label":215}
{"id":"762c605d-949a-43e5-8f16-2c5f5be3836b","text":"This paper proposes to use RNN and reinforcement learning for solving combinatorial optimization problems. The use of pointer network is interesting as it enables generalization to arbitrary input size.. The approach presented in the paper focuses on the traveling salesman problem (TSP) and demonstrates its effectiveness in predicting city permutations for TSP instances with up to 100 nodes. Additionally, the same approach is applied to the KnapSack problem, which is another NP-hard problem, and it achieves optimal solutions for instances with up to 200 items. These results provide valuable insights into how neural networks can be utilized as a versatile tool for tackling combinatorial optimization problems. The experiment has been conducted without relying on extensive engineering or heuristic designing, suggesting that the proposed framework has the potential to be applied to a wide range of combinatorial optimization problems without requiring domain-specific expertise. However, it's worth mentioning that the results reported in the paper are still not at the state-of-the-art level, leaving room for further improvement and investigation.","label":29}
{"id":"638c2a66-0700-4d81-a609-ee4a0f5d26ec","text":"This is very interesting to me! Thank you for this.\r\n\r\nAfter reading this paper, I tested the Concorde. I think the Concorde allows me to compare the results of Concorde with the results obtained from the proposed neural combinatorial optimization framework. The results were quite impressive as the neural network achieved close to optimal results on 2D Euclidean graphs with up to 100 nodes, while Concorde is known for its efficiency in solving the traveling salesman problem. This highlights the potential of using neural networks and reinforcement learning in solving combinatorial optimization problems. However, I would like to see further experiments and comparisons with other state-of-the-art methods to get a better understanding of the performance and limitations of this approach. Overall, this paper provides valuable insights into the applicability of neural networks as a general tool for tackling NP-hard problems.","label":22}
{"id":"db38b156-ba68-40a5-9c2a-1e9e8ca94bf5","text":"I am very glad to read \"Our model and training code Our model and training code show promising results in solving combinatorial optimization problems, particularly the traveling salesman problem and the KnapSack problem. The use of neural networks and reinforcement learning techniques has proven effective in achieving close to optimal solutions without the need for extensive engineering and heuristic designing. However, there is still room for improvement to reach state-of-the-art performance in this field.","label":11}
{"id":"f6e9791d-e5fb-4f92-b01d-bd3f29f312d1","text":"In Table 3, what is the performance for the missing values of RL pretraining with 10.000 batches for Sampling T=1 and T=T*? \r\n\r\nSince performance improved much more from 100 to 1.000 batches for RL pretraining Sampling T=T* than it did for RL pretraining AS (e.g., 5.79->5.71 vs 5.74->5.71 for TSP50), I would expect RL pretraining Sampling T=T* to do better in terms of performance on the missing values. Additionally, it would be interesting to see if increasing the number of batches for RL pretraining Sampling T=T* would continue to improve the results, or if there is a point of diminishing returns. Overall, this paper presents a promising framework for tackling combinatorial optimization problems using neural networks and reinforcement learning. It provides valuable insights into the potential of neural networks as a general tool in solving NP-hard problems. With further improvements and fine-tuning, it is possible that this approach could reach state-of-the-art levels of performance in the future.","label":60}
{"id":"62f8c37e-0dc4-420d-9dac-0d142010706c","text":"There is a large body of work on solving TSP instances that this paper ignores..However, the authors acknowledge the existing literature on solving TSP instances, but they justify their focus on neural networks and reinforcement learning as a novel approach. They argue that their framework achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes, which is a promising result. Although the paper may not cover all prior work, it provides valuable insights into the potential of using neural networks as a general tool for solving combinatorial optimization problems.","label":15}
{"id":"c6dd4b9e-7884-4ec7-a5a6-9fccc106f08d","text":"This work investigates the performance of transfer learning from resource-rich setup (BookTest, CNN\/Daily Mail corpora) to low-resource (bAbI, SQuAD benchmarks) settings. Experiments show poor transfer if the model isn\u2019t shown any training examples from the target domain. However, the results are promising if the model is shown at least a few target-domain examples. The study concludes that deep learning models trained on data-rich domains can be effective in improving performance in low-resource settings, but the transfer is highly dependent on the presence of target-domain examples during training. This has implications for domains where collecting large amounts of training data is challenging. Additionally, the paper highlights that the benefits of pre-training extend beyond word embeddings, suggesting that the learned representations capture important contextual information. Overall, this research contributes to the understanding of semi-supervised learning in reading comprehension tasks, providing insights into the potential of utilizing pre-trained models to enhance performance in data-scarce domains.","label":24}
{"id":"f782eb95-ce4f-4fd8-a480-ffa28afe00ed","text":"The area chair agrees with reviewers 1 and 2, stating that the paper provides valuable insights into the potential of using pre-training in neural network models for reading comprehension tasks. They appreciate the comprehensive experiments conducted and the clear presentation of the results. Reviewer 1 also commends the paper for highlighting the importance of having at least a few target-domain examples for effective transfer learning. Overall, the reviewers find the findings of this study promising and suggest minor revisions to improve clarity and further strengthen the analysis.","label":8}
{"id":"cbaae6ca-9b3f-4d85-aaba-906ad9f2b4cb","text":"Dear authors and reviewers, this paper is currently very interesting and addresses an important challenge in NLP research. The experiments conducted provide valuable insights and suggest potential directions for future investigation.","label":9}
{"id":"397bf5d3-17d6-495b-877a-9c0360b91e07","text":"First I would like to apologize for the delay in reviewing.\r\n\r\nsummary : This work explores several experiments to transfer training a specific model of reading comprehension ( AS Reader), in an artificial and well populated dataset in order to perform in another target dataset. \r\n\r\nHere is what I understand are their several experiments to transfer learning, but I am not 100% sure.\r\n1. The model is trained on the big artificial dataset and tested on the small target datasets (section 4.1)\r\n2. The model is pre-trained on the big artificial dataset like before, then fine-tuned on a few examples from the target dataset and tested on the remaining target examples..The authors explore the effectiveness of using semi-supervised learning in reading comprehension tasks. They train a neural-network-based model on two context-question-answer datasets, the BookTest and CNN\/Daily Mail, and evaluate its performance on subsets of bAbI and SQuAD. The results show limited transfer if the model does not encounter any training examples from the target domain. However, when the model is exposed to a few target-domain examples, the results are promising, indicating the potential of pre-training and fine-tuning techniques in improving performance. Moreover, the authors highlight that the benefits of pre-training extend beyond word embeddings. Although the review provides an initial understanding of the experiments conducted, further elaboration on the specific transfer learning methods employed and the evaluation metrics used would enhance the clarity of the paper. Additionally, it would be helpful to discuss the implications of these findings for real-world applications and potential areas for future research. Overall, this work contributes to the field of reading comprehension by exploring the application of semi-supervised learning and highlighting its potential for improving performance in domains with limited training data.","label":108}
{"id":"d5293330-7f9f-46b7-b171-3564ceae114d","text":"This paper proposes a study of transfer learning in the context of QA from stories. A system is presented with a neural-network-based model trained on two context-question-answer datasets, the BookTest and CNN\/Daily Mail, to explore the potential of using data-rich domains to pre-train models for domains with limited training data. The authors aim to examine the transfer of this pre-trained model to subsets of bAbI, a set of artificial tasks designed to assess reasoning abilities, and SQuAD, a question-answering dataset representative of real-world applications. The experiments conducted reveal that there is limited transferability when the model has no exposure to training examples from the target domain. However, when shown a few target-domain examples, the results demonstrate promising improvements. Moreover, the authors demonstrate that the benefits of pre-training extend beyond word embeddings, indicating the effectiveness of the proposed approach. Overall, this paper offers valuable insights into the potential of semi-supervised learning and transfer learning in the context of reading comprehension. The findings open doors for further exploration in optimizing models for domains with limited training data, contributing to the advancement of natural language processing research and applications.","label":21}
{"id":"f532d13b-e5d1-430a-9c91-9b946b405c3f","text":"This work investigates the performance of transfer learning from resource-rich setup (BookTest, CNN\/Daily Mail corpora) to low-resource (bAbI, SQuAD benchmarks) settings. Experiments show poor improvements in 0-shot learning. However, when the model is exposed to few training instances some improvements are observed.\r\n\r\nThe claims made here require a more comprehensive analysis..However, the review lacks specific details about the methodology and evaluation metrics used in the experiments. Additionally, there is no mention of the limitations or potential biases in the study. Providing more information on the dataset sizes, model architecture, hyperparameters, and training procedures would enhance the understanding of the experiments. Furthermore, it would be beneficial to see a comparison of the proposed approach with other state-of-the-art methods in the field of semi-supervised learning for reading comprehension. Overall, while the preliminary findings are promising, a more in-depth analysis and extensive evaluation are necessary to validate the claims made in the paper.","label":50}
{"id":"c71987bf-f20e-447a-a629-3cd7b91b1e1b","text":".The paper presents a thorough investigation into the use of semi-supervised learning in reading comprehension tasks. The authors demonstrate the limited transfer without any target domain examples and highlight the significance of pre-training. Additionally, they show that pre-training goes beyond word embeddings. Overall, the findings are promising and contribute to our understanding of leveraging data-rich domains for training models in data-scarce domains.","label":0}
{"id":"33f5792e-079f-4651-a3dc-4836508ac5aa","text":"This work investigates the performance of transfer learning from resource-rich setup (BookTest, CNN\/Daily Mail corpora) to low-resource (bAbI, SQuAD benchmarks) settings. Experiments show poor improvements in 0-shot learning. However, when the model is exposed to few training instances some improvements are observed.\r\n\r\nThe claims The results of this study suggest that transfer learning from data-rich domains to low-resource domains in reading comprehension tasks is challenging. The experiments show limited improvement in zero-shot learning, indicating that the model struggles to transfer knowledge without any training examples from the target domain. However, when the model is exposed to a few target-domain examples, promising results are observed. This finding highlights the potential of pre-training models on larger datasets and fine-tuning them on specific domains where training data are scarce. Additionally, the study goes beyond word embeddings by showing that the effect of pre-training extends to other aspects of the model. Overall, this research contributes to the understanding of semi-supervised learning in reading comprehension and the importance of carefully selecting and adapting pre-trained models for specific domains.","label":43}
{"id":"34dcb290-15ec-4b9f-bdb9-d7bee40a2edc","text":"The area chair agrees with reviewers 1 and 3  and is satisfied with the findings of the study. The experiments demonstrate the importance of pre-training in semi-supervised learning for reading comprehension tasks, particularly when limited training data is available in the target domain. The authors also successfully show that pre-training has effects beyond just word embeddings.","label":9}
{"id":"cc47f967-cdb8-42f4-b46f-334e02bc81c1","text":".The paper addresses the challenge of training deep learning models for reading comprehension in domains with limited training data. The authors explore the use of pre-training on data-rich domains and evaluate the transfer of knowledge to target domains. The experiments indicate limited transfer without any target domain training examples, but promising results when at least a few target-domain examples are included. Additionally, the paper discusses the broader impact of pre-training beyond word embeddings. Overall, the findings offer valuable insights into semi-supervised learning in reading comprehension.","label":0}
{"id":"0d03f81c-f27c-4945-acd2-ad185fbe0e6e","text":"First I would like to apologize for the delay in reviewing.\r\n\r\nsummary : This work explores several experiments to transfer training a specific model of reading comprehension ( AS Reader), in an artificial and well populated dataset in order to perform in another target dataset. \r\n\r\nHere is what I understand are their several experiments to transfer learning, but I am not 100% sure.\r\n1. The model is trained on the big artificial dataset and tested on the small target datasets (section 4.1)\r\n2. The The model is trained on the big artificial dataset and tested on the small target datasets (section 4.1). The results showed limited transfer if the model wasn't shown any training examples from the target domain. However, when the model was shown at least a few target-domain examples, the results were promising. This suggests that pre-training on data-rich domains can improve performance on domains with limited training data. Additionally, the study found that the effect of pre-training is not limited to word embeddings, indicating that other aspects of the model can benefit from pre-training as well. Overall, the experiments conducted in this paper shed light on the potential of semi-supervised learning in reading comprehension, particularly in domains where training data are scarce. The findings have implications for the development of more effective and efficient models in NLP tasks, providing valuable insights for further research in this area.","label":81}
{"id":"6091a95d-0643-4db3-96be-8252f042926e","text":"This paper proposes a study of transfer learning in the context of QA from stories. A system is presented with a a short story and has to answer a question about it. This paper studies how a system trained to answer questions on a dataset can eventually be used to answer questions from another dataset. The results are mostly negative: transfer seems almost non-existant.\r\n\r\nThis paper is centered around presenting negative results. Indeed the main hypothesis of transferring between QA datasets with the attention sum reader turns out impossible and one needs a small portion of labeled data from the target domain in order to achieve any meaningful transfer. Despite this negative finding, the paper's contribution lies in highlighting the importance of pre-training models on data-rich domains and the potential benefits of transfer learning in reading comprehension. The paper also sheds light on the limitations of current models and provides insights for future research directions. Overall, the study adds to the growing body of literature on semi-supervised learning in NLP and contributes to the understanding of using pre-training strategies to overcome data scarcity in specific domains of application.","label":99}
{"id":"913d7859-2b5a-4dae-bc57-b2895dde8f1a","text":"This work investigates the performance of transfer learning from resource-rich setup (BookTest, CNN\/Daily Mail corpora) to low-resource (bAbI, SQuAD datasets) for reading comprehension tasks. The authors propose using a neural-network-based model trained on the more abundant datasets to leverage the knowledge and generalize it to the low-resource domains. The study finds that without any target domain examples, the transfer is limited. However, when a few target-domain examples are provided, the performance significantly improves, indicating that pre-training indeed has a positive effect. Moreover, the research demonstrates that pre-training extends beyond word embeddings, suggesting that it can enhance various aspects of the model. The methodology used in the experiments is sound, with clear evaluation metrics and appropriate datasets utilized. The findings have important implications for real-world applications where training data availability is limited. In conclusion, this paper contributes to the existing literature on transfer learning in reading comprehension and provides valuable insights into the effectiveness of pre-training models in low-resource domains.","label":18}
{"id":"b3ae3144-22d5-464e-9ea4-72eaa1adb2f9","text":"Dear Authors,\r\n\r\nPlease resubmit your paper in the Please resubmit your paper in the revised format, addressing the limitations of limited transfer without target domain examples and further exploring the effects of pre-training beyond word embeddings. Additionally, more empirical evidence supporting the promising results would strengthen the paper overall.","label":7}